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Preface 


Most statistical work is concerned directly with the provision and implementa- 
tion of methods for study design and for the analysis and interpretation of data. 
The theory of statistics deals in principle with the general concepts underlying 
all aspects of such work and from this perspective the formal theory of statistical 
inference is but a part of that full theory. Indeed, from the viewpoint of indi- 
vidual applications, it may seem rather a small part. Concern is likely to be more 
concentrated on whether models have been reasonably formulated to address 
the most fruitful questions, on whether the data are subject to unappreciated 
errors or contamination and, especially, on the subject-matter interpretation of 
the analysis and its relation with other knowledge of the field. 

Yet the formal theory is important for a number of reasons. Without some 
systematic structure statistical methods for the analysis of data become a col- 
lection of tricks that are hard to assimilate and interrelate to one another, or 
for that matter to teach. The development of new methods appropriate for new 
problems would become entirely a matter of ad hoc ingenuity. Of course such 
ingenuity is not to be undervalued and indeed one role of theory is to assimilate, 
generalize and perhaps modify and improve the fruits of such ingenuity. 

Much of the theory is concerned with indicating the uncertainty involved in 
the conclusions of statistical analyses, and with assessing the relative merits of 
different methods of analysis, and it is important even at a very applied level to 
have some understanding of the strengths and limitations of such discussions. 
This is connected with somewhat more philosophical issues connected with 
the nature of probability. A final reason, and a very good one, for study of the 
theory is that it is interesting. 

The object of the present book is to set out as compactly as possible the 
key ideas of the subject, in particular aiming to describe and compare the main 
ideas and controversies over more foundational issues that have rumbled on at 
varying levels of intensity for more than 200 years. I have tried to describe the 
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various approaches in a dispassionate way but have added an appendix with a 
more personal assessment of the merits of different ideas. 

Some previous knowledge of statistics is assumed and preferably some 
understanding of the role of statistical methods in applications; the latter 
understanding is important because many of the considerations involved are 
essentially conceptual rather than mathematical and relevant experience is 
necessary to appreciate what is involved. 

The mathematical level has been kept as elementary as is feasible and is 
mostly that, for example, of a university undergraduate education in mathem- 
atics or, for example, physics or engineering or one of the more quantitative 
biological sciences. Further, as I think is appropriate for an introductory discus- 
sion of an essentially applied field, the mathematical style used here eschews 
specification of regularity conditions and theorem-proof style developments. 
Readers primarily interested in the qualitative concepts rather than their devel- 
opment should not spend too long on the more mathematical parts of the 
book. 

The discussion is implicitly strongly motivated by the demands of applic- 
ations, and indeed it can be claimed that virtually everything in the book has 
fruitful application somewhere across the many fields of study to which stat- 
istical ideas are applied. Nevertheless I have not included specific illustrations. 
This is partly to keep the book reasonably short, but, more importantly, to focus 
the discussion on general concepts without the distracting detail of specific 
applications, details which, however, are likely to be crucial for any kind of 
realism. 

The subject has an enormous literature and to avoid overburdening the reader 
I have given, by notes at the end of each chapter, only a limited number of key 
references based on an admittedly selective judgement. Some of the references 
are intended to give an introduction to recent work whereas others point towards 
the history of a theme; sometimes early papers remain a useful introduction to 
a topic, especially to those that have become suffocated with detail. A brief 
historical perspective is given as an appendix. 

The book is a much expanded version of lectures given to doctoral students of 
the Institute of Mathematics, Chalmers/Gothenburg University, and I am very 
grateful to Peter Jagers and Nanny Wermuth for their invitation and encourage- 
ment. It is a pleasure to thank Ruth Keogh, Nancy Reid and Rolf Sundberg for 
their very thoughtful detailed and constructive comments and advice on a pre- 
liminary version. It is a pleasure to thank also Anthony Edwards and Deborah 
Mayo for advice on more specific points. I am solely responsible for errors of 
fact and judgement that remain. 


Preface XV 


The book is in broadly three parts. The first three chapters are largely intro- 
ductory, setting out the formulation of problems, outlining in a simple case 
the nature of frequentist and Bayesian analyses, and describing some special 
models of theoretical and practical importance. The discussion continues with 
the key ideas of likelihood, sufficiency and exponential families. 

Chapter 4 develops some slightly more complicated applications. The long 
Chapter 5 is more conceptual, dealing, in particular, with the various meanings 
of probability as it is used in discussions of statistical inference. Most of the key 
concepts are in these chapters; the remaining chapters, especially Chapters 7 
and 8, are more specialized. 

Especially in the frequentist approach, many problems of realistic complexity 
require approximate methods based on asymptotic theory for their resolution 
and Chapter 6 sets out the main ideas. Chapters 7 and 8 discuss various com- 
plications and developments that are needed from time to time in applications. 
Chapter 9 deals with something almost completely different, the possibil- 
ity of inference based not on a probability model for the data but rather on 
randomization used in the design of the experiment or sampling procedure. 

I have written and talked about these issues for more years than it is com- 
fortable to recall and am grateful to all with whom I have discussed the topics, 
especially, perhaps, to those with whom I disagree. I am grateful particularly 
to David Hinkley with whom I wrote an account of the subject 30 years ago. 
The emphasis in the present book is less on detail and more on concepts but the 
eclectic position of the earlier book has been kept. 

I appreciate greatly the care devoted to this book by Diana Gillooly, Com- 
missioning Editor, and Emma Pearce, Production Editor, Cambridge University 
Press. 
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Preliminaries 


Summary. Key ideas about probability models and the objectives of statist- 
ical analysis are introduced. The differences between frequentist and Bayesian 
analyses are illustrated in a very special case. Some slightly more complicated 
models are introduced as reference points for the following discussion. 


1.1 Starting point 


We typically start with a subject-matter question. Data are or become available 
to address this question. After preliminary screening, checks of data quality and 
simple tabulations and graphs, more formal analysis starts with a provisional 
model. The data are typically split in two parts (y : z), where y is regarded as the 
observed value of a vector random variable Y and zis treated as fixed. Sometimes 
the components of y are direct measurements of relevant properties on study 
individuals and sometimes they are themselves the outcome of some preliminary 
analysis, such as means, measures of variability, regression coefficients and so 
on. The set of variables z typically specifies aspects of the system under study 
that are best treated as purely explanatory and whose observed values are not 
usefully represented by random variables. That is, we are interested solely in the 
distribution of outcome or response variables conditionally on the variables z; a 
particular example is where z represents treatments in arandomized experiment. 

We use throughout the notation that observable random variables are rep- 
resented by capital letters and observations by the corresponding lower case 
letters. 

A model, or strictly a family of models, specifies the density of Y to be 


fr(y: 258), (1.1) 
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where 6 C Qe is unknown. The distribution may depend also on design fea- 
tures of the study that generated the data. We typically simplify the notation to 
Jy (9; 4), although the explanatory variables z are frequently essential in specific 
applications. 

To choose the model appropriately is crucial to fruitful application. 

We follow the very convenient, although deplorable, practice of using the term 
density both for continuous random variables and for the probability function 
of discrete random variables. The deplorability comes from the functions being 
dimensionally different, probabilities per unit of measurement in continuous 
problems and pure numbers in discrete problems. In line with this convention 
in what follows integrals are to be interpreted as sums where necessary. Thus 
we write 


E(Y) = E(Y; 0) = [foie (1.2) 


for the expectation of Y, showing the dependence on 6 only when relevant. The 
integral is interpreted as a sum over the points of support in a purely discrete case. 
Next, for each aspect of the research question we partition 0 as (Y, à), where y 
is called the parameter of interest and i is included to complete the specification 
and commonly called a nuisance parameter. Usually, but not necessarily, y and 
A are variation independent in that Qg is the Cartesian product Ry x Q,. That 
is, any value of y may occur in connection with any value of A. The choice of 
w is a subject-matter question. In many applications it is best to arrange that y 
is a scalar parameter, i.e., to break the research question of interest into simple 
components corresponding to strongly focused and incisive research questions, 
but this is not necessary for the theoretical discussion. 

It is often helpful to distinguish between the primary features of a model 
and the secondary features. If the former are changed the research questions of 
interest have either been changed or at least formulated in an importantly differ- 
ent way, whereas if the secondary features are changed the research questions 
are essentially unaltered. This does not mean that the secondary features are 
unimportant but rather that their influence is typically on the method of estima- 
tion to be used and on the assessment of precision, whereas misformulation of 
the primary features leads to the wrong question being addressed. 

We concentrate on problems where Qg is a subset of R¢, i.e., d-dimensional 
real space. These are so-called fully parametric problems. Other possibilities 
are to have semiparametric problems or fully nonparametric problems. These 
typically involve fewer assumptions of structure and distributional form but 
usually contain strong assumptions about independencies. To an appreciable 
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extent the formal theory of semiparametric models aims to parallel that of 
parametric models. 

The probability model and the choice of y serve to translate a subject-matter 
question into a mathematical and statistical one and clearly the faithfulness of 
the translation is crucial. To check on the appropriateness of a new type of model 
to represent a data-generating process it is sometimes helpful to consider how 
the model could be used to generate synthetic data. This is especially the case 
for stochastic process models. Understanding of new or unfamiliar models can 
be obtained both by mathematical analysis and by simulation, exploiting the 
power of modern computational techniques to assess the kind of data generated 
by a specific kind of model. 


1.2 Role of formal theory of inference 


The formal theory of inference initially takes the family of models as given and 
the objective as being to answer questions about the model in the light of the 
data. Choice of the family of models is, as already remarked, obviously crucial 
but outside the scope of the present discussion. More than one choice may be 
needed to answer different questions. 

A second and complementary phase of the theory concerns what is sometimes 
called model criticism, addressing whether the data suggest minor or major 
modification of the model or in extreme cases whether the whole focus of 
the analysis should be changed. While model criticism is often done rather 
informally in practice, it is important for any formal theory of inference that it 
embraces the issues involved in such checking. 


1.3 Some simple models 


General notation is often not best suited to special cases and so we use more 
conventional notation where appropriate. 


Example 1.1. The normal mean. Whenever it is required to illustrate some 
point in simplest form it is almost inevitable to return to the most hackneyed 
of examples, which is therefore given first. Suppose that Y1,..., Y, are inde- 
pendently normally distributed with unknown mean jz and known variance oĉ. 
Here u plays the role of the unknown parameter 0 in the general formulation. 
In one of many possible generalizations, the variance ø? also is unknown. The 


parameter vector is then (u, 07). The component of interest y would often be ju 
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but could be, for example, o? or jz/o, depending on the focus of subject-matter 
interest. 


Example 1.2. Linear regression. Here the data are n pairs (y1,Z1),---. Ya» Zn) 
and the model is that Y;,..., Y, are independently normally distributed with 
variance ø? and with 


E(Yx) =a + Bz. (1.3) 


Here typically, but not necessarily, the parameter of interest is y = 6 and the 
nuisance parameter is A = (œ, 07). Other possible parameters of interest include 
the intercept at z = 0, namely a, and —a/f, the intercept of the regression line 
on the z-axis. 


Example 1.3. Linear regression in semiparametric form. In Example 1.2 
replace the assumption of normality by an assumption that the Y% are uncorrel- 
ated with constant variance. This is semiparametric in that the systematic part 
of the variation, the linear dependence on zg, is specified parametrically and the 
random part is specified only via its covariance matrix, leaving the functional 
form of its distribution open. A complementary form would leave the system- 
atic part of the variation a largely arbitrary function and specify the distribution 
of error parametrically, possibly of the same normal form as in Example 1.2. 
This would lead to a discussion of smoothing techniques. 


Example 1.4. Linear model. We have an n x 1 vector Y and ann x q matrix z 
of fixed constants such that 


E(Y) =zB, cov(Y) =0o7l, (1.4) 


where $ is aq x 1 vector of unknown parameters, 7 is the n x n identity 
matrix and with, in the analogue of Example 1.2, the components independently 
normally distributed. Here z is, in initial discussion at least, assumed of full 
rank q < n. A relatively simple but important generalization has cov(Y) = 
o”V, where V is a given positive definite matrix. There is a corresponding 
semiparametric version generalizing Example 1.3. 

Both Examples 1.1 and 1.2 are special cases, in the former the matrix z 
consisting of a column of 1s. 


Example 1.5. Normal-theory nonlinear regression. Of the many generaliza- 
tions of Examples 1.2 and 1.4, one important possibility is that the dependence 
on the parameters specifying the systematic part of the structure is nonlinear. 
For example, instead of the linear regression of Example 1.2 we might wish to 
consider 


E(Yx) =a + Bexp(yzx), (1.5) 
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where from the viewpoint of statistical theory the important nonlinearity is not 
in the dependence on the variable z but rather that on the parameter y. 
More generally the equation E(Y) = z£ in (1.4) may be replaced by 


E(Y) = (8), (1.6) 


where the n x 1 vector u(£) is in general a nonlinear function of the unknown 
parameter £ and also of the explanatory variables. 


Example 1.6. Exponential distribution. Here the data are (y1,...,y,) and the 
model takes Y4, ... , Y, to be independently exponentially distributed with dens- 
ity pe”, for y > 0, where p > O is an unknown rate parameter. Note that 
possible parameters of interest are p, log p and 1/p and the issue will arise of 
possible invariance or equivariance of the inference under reparameterization, 
i.e., shifts from, say, o to 1/o. The observations might be intervals between 
successive points in a Poisson process of rate p. The interpretation of 1/p is 
then as a mean interval between successive points in the Poisson process. The 
use of log p would be natural were p to be decomposed into a product of effects 
of different explanatory variables and in particular if the ratio of two rates were 
of interest. 


Example 1.7. Comparison of binomial probabilities. Suppose that the data are 
(ro, no) and (r1, nı), where rg denotes the number of successes in ng binary trials 
under condition k. The simplest model is that the trials are mutually independent 
with probabilities of success zo and 71. Then the random variables Ro and R1 
have independent binomial distributions. We want to compare the probabilities 
and for this may take various forms for the parameter of interest, for example 


y = log{z,/C. — 7 )} —log{zp/(1—79)}, or y =m — mo (1.7) 


and so on. For many purposes it is immaterial how we define the complementary 
parameter A. Interest in the nonlinear function log{z/(1 — 7 )} of a probability 
x stems partly from the interpretation as a log odds, partly because it maps the 
parameter space (0, 1) onto the real line and partly from the simplicity of some 
resulting mathematical models of more complicated dependences, for example 
on a number of explanatory variables. 


Example 1.8. Location and related problems. A different generalization of 
Example 1.1 is to suppose that Y1, .. . , Y„ are independently distributed all with 
the density g(y — u), where g(y) is a given probability density. We call u 
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a location parameter; often it may by convention be taken to be the mean or 
median of the density. 

A further generalization is to densities of the form t~!g{(y — u)/t}, where t 
is a positive parameter called a scale parameter and the family of distributions 
is called a location and scale family. 

Central to the general discussion of such models is the notion of a family of 
transformations of the underlying random variable and the parameters. In the 
location and scale family if Y is transformed to aY; + b, where a > 0 and b are 
arbitrary, then the new random variable has a distribution of the original form 
with transformed parameter values 


au +b, at. (1.8) 


The implication for most purposes is that any method of analysis should obey 
the same transformation properties. That is, if the limits of uncertainty for say 
h, based on the original data, are centred on Y, then the limits of uncertainty for 
the corresponding parameter after transformation are centred on ay + b. 

Typically this represents, in particular, the notion that conclusions should not 
depend on the units of measurement. Of course, some care is needed with this 
idea. If the observations are temperatures, for some purposes arbitrary changes 
of scale and location, i.e., of the nominal zero of temperature, are allowable, 
whereas for others recognition of the absolute zero of temperature is essential. 
In the latter case only transformations from kelvins to some multiple of kelvins 
would be acceptable. 

It is sometimes important to distinguish invariance that springs from some 
subject-matter convention, such as the choice of units of measurement from 
invariance arising out of some mathematical formalism. 


The idea underlying the above example can be expressed in much more gen- 
eral form involving two groups of transformations, one on the sample space 
and one on the parameter space. Data recorded as directions of vectors on a 
circle or sphere provide one example. Another example is that some of the 
techniques of normal-theory multivariate analysis are invariant under arbitrary 
nonsingular linear transformations of the observed vector, whereas other meth- 
ods, notably principal component analysis, are invariant only under orthogonal 
transformations. 

The object of the study of a theory of statistical inference is to provide a 
set of ideas that deal systematically with the above relatively simple situations 
and, more importantly still, enable us to deal with new models that arise in new 
applications. 
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1.4 Formulation of objectives 


We can, as already noted, formulate possible objectives in two parts as follows. 
Part I takes the family of models as given and aims to: 


e give intervals or in general sets of values within which w is in some sense 
likely to lie; 

e assess the consistency of the data with a particular parameter value Wo; 

e predict as yet unobserved random variables from the same random system 
that generated the data; 

e use the data to choose one of a given set of decisions D, requiring the 
specification of the consequences of various decisions. 


Part II uses the data to examine the family of models via a process of model 
criticism. We return to this issue in Section 3.2. 

We shall concentrate in this book largely but not entirely on the first two 
of the objectives in Part I, interval estimation and measuring consistency with 
specified values of y. 

To an appreciable extent the theory of inference is concerned with general- 
izing to a wide class of models two approaches to these issues which will be 
outlined in the next section and with a critical assessment of these approaches. 


1.5 Two broad approaches to statistical inference 


1.5.1 General remarks 


Consider the first objective above, that of providing intervals or sets of values 
likely in some sense to contain the parameter of interest, w. 

There are two broad approaches, called frequentist and Bayesian, respect- 
ively, both with variants. Alternatively the former approach may be said to be 
based on sampling theory and an older term for the latter is that it uses inverse 
probability. Much of the rest of the book is concerned with the similarities 
and differences between these two approaches. As a prelude to the general 
development we show a very simple example of the arguments involved. 

We take for illustration Example 1.1, which concerns a normal distribution 
with unknown mean jz and known variance. In the formulation probability is 
used to model variability as experienced in the phenomenon under study and 
its meaning is as a long-run frequency in repetitions, possibly, or indeed often, 
hypothetical, of that phenomenon. 

What can reasonably be said about u on the basis of observations y1,..., Yn 
and the assumptions about the model? 
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1.5.2 Frequentist discussion 


In the first approach we make no further probabilistic assumptions. In partic- 
ular we treat u as an unknown constant. Strong arguments can be produced 
for reducing the data to their mean y = Yy;/n, which is the observed value 
of the corresponding random variable Y. This random variable has under the 
assumptions of the model a normal distribution of mean jz and variance og /n, 
so that in particular 


P(Y > u —k*too/./n) = 1 — c, (1.9) 


where, with ®(.) denoting the standard normal integral, ®(k*) = 1 — c. For 
example with c = 0.025, k* = 1.96. For a sketch of the proof, see Note 1.5. 
Thus the statement equivalent to (1.9) that 


P(u < ¥ + k*too//n) = 1 — c, (1.10) 


can be interpreted as specifying a hypothetical long run of statements about u 
a proportion 1 — c of which are correct. We have observed the value y of the 
random variable Y and the statement 


u <ytkroo/J/n (1.11) 


is thus one of this long run of statements, a specified proportion of which are 
correct. In the most direct formulation of this u is fixed and the statements vary 
and this distinguishes the statement from a probability distribution for jz. In fact 
a similar interpretation holds if the repetitions concern an arbitrary sequence of 
fixed values of the mean. 

There are a large number of generalizations of this result, many underpinning 
standard elementary statistical techniques. For instance, if the variance ø? is 
unknown and estimated by ÈX (yg — p? /(n — 1) in (1.9), then k* is replaced 
by the corresponding point in the Student ż distribution with n — 1 degrees of 
freedom. 

There is no need to restrict the analysis to a single level c and provided 
concordant procedures are used at the different c a formal distribution is built up. 

Arguments involving probability only via its (hypothetical) long-run fre- 
quency interpretation are called frequentist. That is, we define procedures for 
assessing evidence that are calibrated by how they would perform were they 
used repeatedly. In that sense they do not differ from other measuring instru- 
ments. We intend, of course, that this long-run behaviour is some assurance that 
with our particular data currently under analysis sound conclusions are drawn. 
This raises important issues of ensuring, as far as is feasible, the relevance of 
the long run to the specific instance. 
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1.5.3 Bayesian discussion 


In the second approach to the problem we treat u as having a probability dis- 
tribution both with and without the data. This raises two questions: what is the 
meaning of probability in such a context, some extended or modified notion of 
probability usually being involved, and how do we obtain numerical values for 
the relevant probabilities? This is discussed further later, especially in Chapter 5. 
For the moment we assume some such notion of probability concerned with 
measuring uncertainty is available. 

If indeed we can treat jz as the realized but unobserved value of a random 
variable M, all is in principle straightforward. By Bayes’ theorem, i.e., by simple 
laws of probability, 


Jux | y) = frmo | 10) fu) / fimo | d) fu (o)dd. (1.12) 


The left-hand side is called the posterior density of M and of the two terms in the 
numerator the first is determined by the model and the other, fm (u), forms the 
prior distribution summarizing information about M not arising from y. Any 
method of inference treating the unknown parameter as having a probability 
distribution is called Bayesian or, in an older terminology, an argument of 
inverse probability. The latter name arises from the inversion of the order of 
target and conditioning events as between the model and the posterior density. 

The intuitive idea is that in such cases all relevant information about u is 
then contained in the conditional distribution of the parameter given the data, 
that this is determined by the elementary formulae of probability theory and 
that remaining problems are solely computational. 

In our example suppose that the prior for u is normal with known mean m 
and variance v. Then the posterior density for jz is proportional to 


exp{—E (ye — 4)"/(20q) — (u — m)*/(2v)} (1.13) 
considered as a function of u. On completing the square as a function of n, 
there results a normal distribution of mean and variance respectively 
y/(og/n) + m/v 
1/(og/n) + 1/v’ 
1 . 
1/(og/n) + 1/v° 
for more details of the argument, see Note 1.5. Thus an upper limit for u satisfied 
with posterior probability 1 — c is 
= (2 
y/(o9/n) + m/v 1 
SONS Ee of / hes 
1/(o9/n) + 1/v 1/(o9/n) + 1/v 


(1.14) 


(1.15) 


(1.16) 
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If v is large compared with ör /n and m is not very different from y these 
limits agree closely with those obtained by the frequentist method. If there is a 
serious discrepancy between y and m this indicates either a flaw in the data or 
a misspecification of the prior distribution. 

This broad parallel between the two types of analysis is in no way specific 
to the normal distribution. 


1.6 Some further discussion 


We now give some more detailed discussion especially of Example 1.4 and 
outline a number of special models that illustrate important issues. 

The linear model of Example 1.4 and methods of analysis of it stemming 
from the method of least squares are of much direct importance and also are the 
base of many generalizations. The central results can be expressed in matrix 
form centring on the least squares estimating equations 


gapaz Y, (1.17) 
the vector of fitted values 
Y =f, (1.18) 
and the residual sum of squares 
RSS = (Y - TO —f)y = YTY — ÊT eTa Ê. (1.19) 


Insight into the form of these results is obtained by noting that were it not 
for random error the vector Y would lie in the space spanned by the columns 
of z, that Y is the orthogonal projection of Y onto that space, defined thus by 


ZO -=z (Y -zĝ)=0 (1.20) 


and that the residual sum of squares is the squared norm of the component of 
Y orthogonal to the columns of z. See Figure 1.1. 

There is a fairly direct generalization of these results to the nonlinear regres- 
sion model of Example 1.5. Here if there were no error the observations would 
lie on the surface defined by the vector u (£) as £ varies. Orthogonal projection 
involves finding the point uÊ ) closest to Y in the least squares sense, i.e., min- 
imizing the sum of squares of deviations {Y — u (B)}" { Y — (6)}. The resulting 
equations defining B are best expressed by defining 


z” (B) = Vu" (B), (1.21) 


where V is the q x 1 gradient operator with respect to B, i.e, VT = 
(8/əß1, . . .,9/3ßq). Thus z(B) is an n x q matrix, reducing to the previous z 
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Figure 1.1. Linear model. Without random error the vector Y would lie in the 
shaded space spanned by the columns z1, z2,... of the matrix z. The least squares 
estimate Ê is defined by orthogonal projection of Y onto the space determined by z. 
For orthogonality the vector Y — zB must be orthogonal to the vectors Zz), Z2,.... 
Further a Pythogorean identity holds for the squared length YTY decomposing it 
into the residual sum of squares, RSS, and the squared length of the vector of fitted 
values. 


in the linear case. Just as the columns of z define the linear model, the columns 
of z(B) define the tangent space to the model surface evaluated at 6. The least 
squares estimating equation is thus 


z7 (B){¥ — uÊ) = 0. (1.22) 


The local linearization implicit in this is valuable for numerical iteration. 
One of the simplest special cases arises when E(Y;,) = Bo exp(—f1z,) and the 
geometry underlying the nonlinear least squares equations is summarized in 
Figure 1.2. 

The simple examples used here in illustration have one component random 
variable attached to each observation and all random variables are mutually 
independent. In many situations random variation comes from several sources 
and random components attached to different component observations may not 
be independent, showing for example temporal or spatial dependence. 


Example 1.9. A component of variance model. The simplest model with two 
sources of random variability is the normal-theory component of variance for- 
mulation, which for random variables ¥;,;k = 1,...,m;s = 1,...,t¢ has 
the form 


Yks = U + Nk + Eks- (1.23) 


Here u is an unknown mean and the 7 and the € are mutually independent nor- 
mally distributed random variables with zero mean and variances respectively 
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Y 


= 
Ww 
yi 
Figure 1.2. Nonlinear least squares estimation. Without random error the obser- 
vations would lie on the curved surface shown here for just two observations. The 


least squares estimates are obtained by projection such that the residual vector is 
orthogonal to the tangent plane to the surface at the fitted point. 


Tn, Te, called components of variance. This model represents the simplest form 
for the resolution of random variability into two components. The model could 
be specified by the simple block-diagonal form of the covariance matrix of Y 
considered as a single column vector. 

Models of this broad type with several layers of variation are sometimes 
called Bayesian, a misleading terminology that will not be used here. 


Example 1.10. Markov models. For any sequence of random variables 
Y,,..., Yn the joint density can be factorized recursively in the form 


Fy, OD fray (925 V1) -+ + Av gL Pp ye Yn a1 (ns Y1 + + -> Yn—-1)- (1.24) 


If the process is a Markov process in which very often the sequence is in time, 
there is the major simplification that in each term the conditioning is only on 
the preceding term, so that the density is 


fy, ODT yh Ye) (Yk; Yk-D- (1.25) 


That is, to produce a parametric Markov process we have to specify only the 
one-step transition probabilities in parametric form. 

Another commonly occurring more complicated form of variation arises with 
time series, spatial and spatial-temporal data. The simplest time series model 
is the Gaussian first-order autoregression, a Markov process defined by the 
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equation 
Y, — u = (Y1 — U) + €r. (1.26) 


Here the forcing terms, €;, are called innovations. They are assumed to be 
independent and identically distributed random variables of zero mean and 
variance a. The specification is completed once the value or distribution of 
the initial condition Yj is set out. 

Extensions of this idea to spatial and spatial-temporal data are important; see 
Note 1.6. 


1.7 Parameters 


A central role is played throughout the book by the notion of a parameter 
vector, 0. Initially this serves to index the different probability distributions 
making up the full model. If interest were exclusively in these probability distri- 
butions as such, any (1, 1) transformation of 0 would serve equally well and the 
choice of a particular version would be essentially one of convenience. For most 
of the applications in mind here, however, the interpretation is via specific para- 
meters and this raises the need both to separate parameters of interest, y, from 
nuisance parameters, à, and to choose specific representations. In relatively 
complicated problems where several different research questions are under 
study different parameterizations may be needed for different purposes. 

There are a number of criteria that may be used to define the individual 
component parameters. These include the following: 


e the components should have clear subject-matter interpretations, for 
example as differences, rates of change or as properties such as in a 
physical context mass, energy and so on. If not dimensionless they should 
be measured on a scale unlikely to produce very large or very small values; 

e itis desirable that this interpretation is retained under reasonable 
perturbations of the model; 

e different components should not have highly correlated errors of estimation; 

e statistical theory for estimation should be simple; 

e if iterative methods of computation are needed then speedy and assured 
convergence is desirable. 


The first criterion is of primary importance for parameters of interest, at 
least in the presentation of conclusions, but for nuisance parameters the other 
criteria are of main interest. There are considerable advantages in formulations 
leading to simple methods of analysis and judicious simplicity is a powerful aid 
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to understanding, but for parameters of interest subject-matter meaning must 
have priority. 


Notes 1 


General. There are quite a number of books setting out in some detail the 
mathematical details for this chapter and for much of the subsequent material. 
Davison (2003) provides an extended and wide-ranging account of statistical 
methods and their theory. Good introductory accounts from a more theoretical 
stance are Rice (1988) and, at a somewhat more advanced level, Casella and 
Berger (1990). For an introduction to Bayesian methods, see Box and Tiao 
(1973). Rao (1973) gives a wide-ranging account of statistical theory together 
with some of the necessary mathematical background. Azzalini (1996), Pawitan 
(2000) and Severini (2001) emphasize the role of likelihood, as forms the basis 
of the present book. Silvey (1970) gives a compact and elegant account of 
statistical theory taking a different viewpoint from that adopted here. Young 
and Smith (2005) give a broad introduction to the theory of statistical infer- 
ence. The account of the theory of statistics by Cox and Hinkley (1974) is 
more detailed than that in the present book. Barnett and Barnett (1999) give a 
broad comparative view. Williams (2001) provides an original and wide-ranging 
introduction to probability and to statistical theory. 


Section 1.1. The sequence question—data—analysis is emphasized here. In data 
mining the sequence may be closer to data—analysis—question. Virtually the 
same statistical methods may turn out to be useful, but the probabilistic 
interpretation needs considerable additional caution in the latter case. 

More formally (1.2) can be written 


E(Y) = foroa, 


where v(.) is a dominating measure, typically to be thought of as dy in regions 
of continuity and having atoms of unit size where Y has a positive probability. 
The support of a distribution is loosely the set of points with nonzero density; 
more formally every open set containing a point of support should have positive 
probability. In most but not all problems the support does not depend on 8. 

A great deal can be achieved from the properties of E and the associated 
notions of variance and covariance, i.e., from first and second moment calcula- 
tions. In particular, subject only to existence the expected value of a sum is the 
sum of the expected values and for independent random variables the expect- 
ation of a product is the product of the expectations. There is also a valuable 
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property of conditional expectations summarized in E(Y) = EyE(Y | X). 
An important extension obtained by applying this last property also to Y is 
that var(Y) = Exy{var(Y | X)} + vary{E(Y | X)}. That is, an unconditional 
variance is the expectation of a conditional variance plus the variance of a 
conditional mean. 

The random variables considered all take values in some finite-dimensional 
real space. It is possible to consider more general possibilities, such as 
complex-valued random variables or random variables in some space of infin- 
ite dimension, such as Hilbert space, but this will not be necessary here. The 
so-called singular continuous distributions of probability theory arise only as 
absolutely continuous distributions on some manifold of lower dimension than 
the space originally considered. In general discussions of probability, the density 
for absolutely continuous random variables is not uniquely defined, only integ- 
rals over sets being meaningful. This nonuniqueness is of no concern here, partly 
because all real distributions are discrete and partly because there is nothing to 
be gained by considering versions of the density other than those well behaved 
for all members of the family of distributions under study. 


Section 1.3. The examples in this section are mostly those arising in accounts 
of elementary statistical methods. All of them can be generalized in various 
ways. The assumptions of Example 1.3 form the basis of the discussion of least 
squares theory by Gauss leading to the notion of linear unbiased estimates of 
minimum variance. 


Section 1.4. It may seem odd to exclude point estimation from the list of topics 
but it is most naturally regarded either as a decision problem, involving therefore 
no explicit statement of uncertainty, or as an intermediate stage of analysis of 
relatively complex data. 


Section 1.5. The moment generating function My (p) for a random variable Y, 
equivalent except for sign to a Laplace transform of the density, is E(e?”) or 
in the vector case E (er Y ). For a normal distribution of mean jz and variance 
a? it is exp(pu + p*o*/2). For the sum of independent random variables, 5 = 
XY; the product law of expectations gives Ms(p) = exp(È ukp + Eofp’/2). 
A uniqueness theorem then proves that S and hence Y has a normal distribution. 

Equation (1.12) is the version for densities of the immediate consequence of 
the definition of conditional probability. For events A and B this takes the form 


P(B | A) = P(A A B)/P(A) = P(A | B)P(B)/P(A). 
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The details of the calculation for (1.13), (1.14) and (1.15) have been omitted. 
Note that E (yg — u)? = n(y — u)? + E (yk — y)’, in which the second term 
can be ignored because it does not involve u, and that finally as a function of 
u we can write Au? + 2Bu + C as A(u + B/A}? plus a constant independent 
of u. 


Section 1.6. Most of the general books mentioned above discuss the normal- 
theory linear model as do the numerous textbooks on regression analysis. For a 
systematic study of components of variance, see, for instance, Cox and Solomon 
(2002) and for time series analysis Brockwell and Davis (1991, 1998). For 
spatial processes, see Besag (1974) and Besag and Mondal (2005). 


Section 1.7. For a discussion of choice of parameterization especially in 
nonlinear models, see Ross (1990). 


2 


Some concepts and simple applications 


Summary. An introduction is given to the notions of likelihood and sufficiency 
and the exponential family of distributions is defined and exemplified. 


2.1 Likelihood 


The likelihood for the vector of observations y is defined as 
lik(0; y) = fy(y; 8), (2.1) 


considered in the first place as a function of 6 for given y. Mostly we work 
with its logarithm /(@; y), often abbreviated to /(0). Sometimes this is treated 
as a function of the random vector Y rather than of y. The log form is con- 
venient, in particular because f will often be a product of component terms. 
Occasionally we work directly with the likelihood function itself. For nearly 
all purposes multiplying the likelihood formally by an arbitrary function of y, 
or equivalently adding an arbitrary such function to the log likelihood, would 
leave unchanged that part of the analysis hinging on direct calculations with the 
likelihood. 

Any calculation of a posterior density, whatever the prior distribution, uses 
the data only via the likelihood. Beyond that, there is some intuitive appeal in 
the idea that differences in /(9) measure the relative effectiveness of different 
parameter values 0 in explaining the data. This is sometimes elevated into a 
principle called the law of the likelihood. 

A key issue concerns the additional arguments needed to extract useful 
information from the likelihood, especially in relatively complicated problems 
possibly with many nuisance parameters. Likelihood will play a central role in 
almost all the arguments that follow. 
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2.2 Sufficiency 
2.2.1 Definition 


The term statistic is often used (rather oddly) to mean any function of the 
observed random variable Y or its observed counterpart. A statistic S = S(Y) 
is called sufficient under the model if the conditional distribution of Y given 
S = s is independent of 6 for all s, 0. Equivalently 


1(0;y) = log h(s,@) + log m(y), (2.2) 


for suitable functions h and m. The equivalence forms what is called the Neyman 
factorization theorem. The proof in the discrete case follows most explicitly by 
defining any new variable W, a function of Y, such that Y is in (1,1) correspond- 
ence with (S, W), i.e., such that (S, W) determines Y. The individual atoms of 
probability are unchanged by transformation. That is, 


fr(93 9) = fs,w(s,w;0) = fs (s; 0) fwjs(w; 8), (2.3) 


where the last term is independent of 0 by definition. In the continuous case 
there is the minor modification that a Jacobian, not involving 0, is needed when 
transforming from Y to (S, W). See Note 2.2. 

We use the minimal form of S; i.e., extra components could always be added 
to any given S and the sufficiency property retained. Such addition is undesirable 
and is excluded by the requirement of minimality. The minimal form always 
exists and is essentially unique. 

Any Bayesian inference uses the data only via the minimal sufficient statistic. 
This is because the calculation of the posterior distribution involves multiplying 
the likelihood by the prior and normalizing. Any factor of the likelihood that is 
a function of y alone will disappear after normalization. 

In a broader context the importance of sufficiency can be considered to arise 
as follows. Suppose that instead of observing Y = y we were equivalently to 
be given the data in two stages: 


e first we observe S = s, an observation from the density fs (s; 0); 
e then we are given the remaining data, in effect an observation from the 
density fy|s(y; 5). 


Now, so long as the model holds, the second stage is an observation on a 
fixed and known distribution which could as well have been obtained from a 
random number generator. Therefore S = s contains all the information about 
0 given the model, whereas the conditional distribution of Y given S = s allows 
assessment of the model. 
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There are thus two implications of sufficiency. One is that given the model and 
disregarding considerations of robustness or computational simplicity the data 
should be used only via s. The second is that the known distribution fy)s(y | s) 
is available to examine the adequacy of the model. Illustrations will be given in 
Chapter 3. 

Sufficiency may reduce the dimensionality of a problem but we still have to 
determine what to do with the sufficient statistic once found. 


2.2.2 Simple examples 
Example 2.1. Exponential distribution (ctd). The likelihood for Example 1.6 is 
p” exp(—pLXyx), (2.4) 
so that the log likelihood is 
nlog p — pXyx, (2.5) 


and, assuming n to be fixed, involves the data only via Ly, or equivalently 
via y = Ly;,/n. By the factorization theorem the sum (or mean) is therefore 
sufficient. Note that had the sample size also been random the sufficient statistic 
would have been (n, Xy;); see Example 2.4 for further discussion. 

In this example the density of S = DY; is p(s)" 'e~?’/(n — 1)!, a gamma 
distribution. It follows that oS has a fixed distribution. It follows also that the 
joint conditional density of the Yp given S = s is uniform over the simplex 
0 < yk < s; Ly, = s. This can be used to test the adequacy of the model. 


Example 2.2. Linear model (ctd). A minimal sufficient statistic for the linear 
model, Example 1.4, consists of the least squares estimates and the residual 
sum of squares. This strong justification of the use of least squares estimates 
depends on writing the log likelihood in the form 


—nlogo — (y — zB)" O — zB)/(207) (2.6) 


and then noting that 

(y — 2B)" (y — zB) = (y — 2B)" (y — 2B) + B - ETDE 27) 
in virtue of the equations defining the least squares estimates. This last identity 
has a direct geometrical interpretation. The squared norm of the vector defined 
by the difference between Y and its expected value zB is decomposed into a 
component defined by the difference between Y and the estimated mean z B and 
an orthogonal component defined via B — B. See Figure 1.1. 

It follows that the log likelihood involves the data only via the least squares 


estimates and the residual sum of squares. Moreover, if the variance o? were 
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known, the residual sum of squares would be a constant term in the log likelihood 
and hence the sufficient statistic would be reduced to B alone. 

This argument fails for a regression model nonlinear in the parameters, such 
as the exponential regression (1.5). In the absence of error the n x 1 vector of 
observations then lies on a curved surface and while the least squares estimates 
are still given by orthogonal projection they satisfy nonlinear equations and 
the decomposition of the log likelihood which is the basis of the argument 
for sufficiency holds only as an approximation obtained by treating the curved 
surface as locally flat. 


Example 2.3. Uniform distribution. Suppose that Yi,...,Y, are independ- 
ent and identically distributed in a uniform distribution on (61,02). Let 
Y(1), Yn) be the smallest and largest order statistics, i.e., the random variables 
min(Yj,..., Yn) andmax(11,..., Yn) respectively. Then the likelihood involves 
the data only via (1), ¥(n)), being zero unless the interval (y1), yqm)) is within 
(01, 62). Therefore the two order statistics are minimal sufficient. Similar results 
hold when one terminus is known or when the range of the distribution, i.e., 
62 — 01, is known. 


In general, care is needed in writing down the likelihood and in evaluating the 
resulting procedures whenever the form of the likelihood has discontinuities, 
in particular, as here, where the support of the probability distribution depends 
on the unknown parameter. 


Example 2.4. Binary fission. Suppose that in a system of particles each particle 
has in small time period (t,t + A) a probability pA + o(A) of splitting into 
two particles with identical properties and a probability 1 — pA + o(A) of 
remaining unchanged, events for different particles being independent. Suppose 
that the system is observed for a time period (0, fo). Then by dividing the whole 
exposure history into small intervals separately for each individual we have that 
the likelihood is 


pre Pre, (2.8) 


Here n is the number of new particles born and te is the total exposure time, 
i.e., the sum of all life-times of individuals in the system, and both are random 
variables, so that the sufficient statistic is S = (N, Te). 


2.3 Exponential family 


We now consider a family of models that are important both because they have 
relatively simple properties and because they capture in one discussion many, 
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although by no means all, of the inferential properties connected with important 
standard distributions, the binomial, Poisson and geometric distributions and 
the normal and gamma distributions, and others too. 

Suppose that @ takes values in a well-behaved region in Rf, and not of 
lower dimension, and that we can find a (d x 1)-dimensional statistic s and 
a parameterization @, i.e., a (1,1) transformation of 0, such that the model has 
the form 


m(y) exp{s’ o — k(p)}, (2.9) 


where s = s(y) is a function of the data. Then S is sufficient; subject to some 
important regularity conditions this is called a regular or full (d, d) exponential 
family of distributions. The statistic S is called the canonical statistic and ¢ the 
canonical parameter. The parameter n = E(S; ġ) is called the mean parameter. 
Because the stated function defines a distribution, it follows that 


[mcr explo" -o= 2.10) 

[ morespis"@ +») =k + Pyidy = 1. Q.11) 

Hence, noting that s’p = p's, we have from (2.11) that the moment 
generating function of S is 

E{exp(p' S)} = exp{k(¢ + p) — k(o)}- (2.12) 


Therefore the cumulant generating function of S, defined as the log of the 
moment generating function, is 


kb + p) — k(@), (2.13) 


providing a fairly direct interpretation of the function k(.). Because the mean 
is given by the first derivatives of that generating function, we have that n = 
Vk(@), where V is the gradient operator (0/4¢1,...,0/8¢a)". See Note 2.3 
for a brief account of both cumulant generating functions and the V notation. 


Example 2.5. Binomial distribution. If R denotes the number of successes in 
n independent binary trials each with probability of success zr, its density can 
be written 


n!/{r(n — r) yY" (1 — x)" = m(r) exp{rd — nlog(1 + e®)}, (2.14) 


say, where @ = log{z/(1 — z)}, often called the log odds, is the canonical 
parameter and r the canonical statistic. Note that the mean parameter is E(R) = 
nz and can be recovered also by differentiating k(¢). 
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In general for continuous distributions a fixed smooth (1,1) transformation of 
the random variable preserves the exponential family form; the Jacobian of the 
transformation changes only the function m(y). Thus if Y has the exponential 
distribution pe~ ®’, or in exponential family form exp(—py + log p), then X = 
,/Y has the Rayleigh distribution 2px exp(— x’), or in exponential family form 
2x exp(—px? + log p). The effect of the transformation from Y to ,/Y is thus to 
change the initializing function m(.) but otherwise leave the canonical statistic 
taking the same value and leaving the canonical parameter and the function k(.) 
unchanged. 

In general the likelihood as a function of @ is maximized by solving the 
equation in ĝ 


s= Vk($)4_3 = A (2.15) 


where it can be shown that for a regular exponential family the solution is 
unique and indeed corresponds to a maximum, unless possibly if s lies on the 
boundary of its support. We call ĝ and 7 the maximum likelihood estimates of @ 
and 7 respectively. In (2.15) we equate the canonical statistic to its expectation 
to form an estimating equation which has a direct appeal, quite apart from more 
formal issues connected with likelihood. The important role of the maximum 
likelihood estimate will appear later. 

A further notion is that of a curved exponential family. Suppose that we start 
with a full exponential family in which the canonical parameter ¢ and canonical 
statistic s are both of dimension k. Now suppose that ¢ is restricted to lie in 
a curved space of dimension d, where d < k. Then the exponential family 
form will still hold but now the k x 1 vector ¢ is a function of d originating 
parameters defining position in the curved space. This is called a (k, d) curved 
exponential family. The statistical properties of curved exponential families are 
much less simple than those of full exponential families. The normal-theory 
nonlinear regression already discussed is an important example. The following 
are simpler instances. 


Example 2.6. Fisher’s hyperbola. Suppose that the model is that the pairs 
(Yx1, Yk2) are independently normally distributed with unit variance and with 
means (6,c/@), where c is a known constant. The log likelihood for a single 
observation is 


yk10 + cyk2/0 — 67/2 — 07? /2. (2.16) 


Note that the squared terms in the normal density may be omitted because 
they are not functions of the unknown parameter. It follows that for the full 
set of observations the sample totals or means form the minimal sufficient 
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statistic. Thus what in the full family would be the two-dimensional parameter 
(¢1, $2) is confined to the hyperbola (0,c/0) forming a (2,1) family with a 
two-dimensional sufficient statistic and a one-dimensional parameter. 


Example 2.7. Binary fission (ctd). As shown in Example 2.4 the likelihood is 
a function of n, the number of births, and te, the total exposure time. This forms 
in general a (2, 1) exponential family. Note, however, that if we observed until 
either one of n and te took a preassigned value, then it would reduce to a full, 
i.e., (1, 1), family. 


2.4 Choice of priors for exponential family problems 


While in Bayesian theory choice of prior is in principle not an issue of achieving 
mathematical simplicity, nevertheless there are gains in using reasonably simple 
and flexible forms. In particular, if the likelihood has the full exponential family 
form 


m(y) exp{s! $ — k(p)}. (2.17) 
a prior for @ proportional to 
exp{so $ — aok($)} (2.18) 
leads to a posterior proportional to 


exp{(s + 50)" — (1 + ao)k(o)}. (2.19) 


Such a prior is called conjugate to the likelihood, or sometimes closed under 
sampling. The posterior distribution has the same form as the prior with so 
replaced by s + sg and ao replaced by 1 + ao. 


Example 2.8. Binomial distribution (ctd). This continues the discussion of 
Example 2.5. If the prior for x is proportional to 


(1 — m)", (2.20) 


i.e., is a beta distribution, then the posterior is another beta distribution corres- 
ponding to r + ro successes in n + ng trials. Thus both prior and posterior are 
beta distributions. It may help to think of (rọ, nọ) as fictitious data! If the prior 
information corresponds to fairly small values of nọ its effect on the conclusions 
will be small if the amount of real data is appreciable. 
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2.5 Simple frequentist discussion 


In Bayesian approaches sufficiency arises as a convenient simplification of the 
likelihood; whatever the prior the posterior is formed from the likelihood and 
hence depends on the data only via the sufficient statistic. 

In frequentist approaches the issue is more complicated. Faced with a new 
model as the basis for analysis, we look for a Fisherian reduction, defined as 
follows: 


e find the likelihood function; 

e reduce to a sufficient statistic S of the same dimension as 6; 

e find a function of S that has a distribution depending only on yy; 

e invert that distribution to obtain limits for y at an arbitrary set of 
probability levels; 

e use the conditional distribution of the data given S = s informally or 
formally to assess the adequacy of the formulation. 


Immediate application is largely confined to regular exponential family models. 

While most of our discussion will centre on inference about the parameter 
of interest, y, the complementary role of sufficiency of providing an explicit 
base for model checking is in principle very important. It recognizes that our 
formulations are always to some extent provisional, and usually capable to some 
extent of empirical check; the universe of discussion is not closed. In general 
there is no specification of what to do if the initial formulation is inadequate but, 
while that might sometimes be clear in broad outline, it seems both in practice 
and in principle unwise to expect such a specification to be set out in detail in 
each application. 

The next phase of the analysis is to determine how to use s to answer more 
focused questions, for example about the parameter of interest y. The simplest 
possibility is that there are no nuisance parameters, just a single parameter y 
of interest, and reduction to a single component s occurs. We then have one 
observation on a known density fs(s; y) and distribution function Fs(s; y). 
Subject to some monotonicity conditions which, in applications, are typically 
satisfied, the probability statement 


P(S < ac(Y)) = Fs(ac(4); Yy) =1—e (2.21) 
can be inverted for continuous random variables into 
Ply < b-(S)}=1—c. (2.22) 


Thus the statement on the basis of data y, yielding sufficient statistic s, that 


Y < bels) (2.23) 
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provides an upper bound for w, that is a single member of a hypothetical long 
run of statements a proportion 1 — c of which are true, generating a set of 
statements in principle at all values of c in (0, 1). 

This extends the discussion in Section 1.5, where S is the mean Y of n 
independent and identically normally distributed random variables of known 
variance and the parameter is the unknown mean. 

For discrete distributions there is the complication that the distribution func- 
tion is a step function so that the inversion operation involved is nonunique. 
This is an embarrassment typically only when very small amounts of data are 
involved. We return in Section 3.2 to discuss how best to handle this issue. 

An alternative avenue, following the important and influential work of 
J. Neyman and E. S. Pearson, in a sense proceeds in the reverse direction. 
Optimality criteria are set up for a formulation of a specific issue, expressing 
some notion of achieving the most sensitive analysis possible with the available 
data. The optimality problem is then solved (typically by appeal to a reservoir 
of relevant theorems) and a procedure produced which, in the contexts under 
discussion, is then found to involve the data only via s. The two avenues to s 
nearly, but not quite always, lead to the same destination. This second route 
will not be followed in the present book. 

In the Neyman—Pearson theory of tests, the sensitivity of a test is assessed 
by the notion of power, defined as the probability of reaching a preset level 
of significance considered for various alternative hypotheses. In the approach 
adopted here the assessment is via the distribution of the random variable P, 
again considered for various alternatives. The main applications of power are 
in the comparison of alternative procedures and in preliminary calculation of 
appropriate sample size in the design of studies. The latter application, which 
is inevitably very approximate, is almost always better done by considering the 
width of confidence intervals. 


2.6 Pivots 


The following definition is often helpful in dealing with individual applications 
including those with nuisance parameters. 

Suppose that y is one-dimensional. Suppose that there is a statistic , in the 
present context a function of s but possibly of more general form, and a function 
p(t, Y) such that for all 0 C Qo the random variable p(T, y) has a fixed and 
known continuous distribution and that for all ¢ the function p(t, Y) is strictly 
increasing in y. Then p is called a pivot for inference about y and the fixed 
distribution is called the pivotal distribution. 
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We have that for all y and all c, 0 < c < 1, we can find p* such that 


P{p(T, Y) < ph. (2.24) 
implying that 
Ply <q(T,c)}=1-ce. (2.25) 


We call g(t,c) a 1 — c level upper limit for y with obvious changes for 
lower limits and intervals. The collection of limits for all c encapsulates our 
information about y and the associated uncertainty. In many applications it 
is convenient to summarize this by giving an upper | — c limit and a lower 
1 — c limit forming a 1 — 2c equi-tailed confidence interval, usually specifying 
this for one or two conventionally chosen values of c. The use of equal tails is 
essentially a presentational simplification. Clearly other than equal tails could 
be used if there were good reason for doing so. 

In the special discussion of Section 1.5 the pivot is (u — Y)/(o0/./n) and 
the pivotal distribution is the standard normal distribution. 

Note that the frequentist limits of Section 1.5.2 correspond to the upper c 
point of the posterior distribution of u on taking the limit in (1.14) and (1.15) 
as nv/ og increases. Later, in Examples 5.5 and 5.6, the dangers of apparently 
innocuous priors will be discussed. 

The use of pivots is primarily a presentational device. A key notion in the 
frequentist approach is that of sets of confidence intervals at various levels 
and the corresponding notion in Bayesian discussions is that of a posterior 
distribution. At least in relatively simple cases, these are most compactly sum- 
marized by a pivot and its distribution. In particular, it may be possible to obtain 
statistics my and sy such that in the frequentist approach (Yy — my)/sy has 
exactly, or to an adequate approximation, a standard normal distribution, or pos- 
sibly a Student ż distribution. Then confidence intervals at all levels are easily 
recovered from just the two summarizing statistics. Note that from this per- 
spective these statistics are just a convenient convention for summarization; in 
particular, it is best not to regard my as a point estimate of y to be considered in 
isolation. 

Example 1.1 can be generalized in many ways. If the standard deviation is 
unknown and oo is replaced by its standard estimate derived from the residual 
sum of squares, the pivotal distribution becomes the Student rf distribution with 
n — | degrees of freedom. Further if we consider other than sufficient statistics, 
the mean could be replaced by the median or by any other location estimate 
with a corresponding modification of the pivotal distribution. The argument 
applies directly to the estimation of any linear parameter in the linear model 
of Example 1.4. In the second-order linear model generalizing Example 1.3, 
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the pivotal distribution is asymptotically standard normal by the Central Limit 
Theorem and, if the variance is estimated, also by appeal to the Weak Law of 
Large Numbers. 

In the previous examples the parameter of interest has been a scalar. The 
ideas extend to vector parameters of interest y, although in most applica- 
tions it is often preferable to work with single interpretable components, often 
representing contrasts or measures of dependence interpreted one at a time. 


Example 2.9. Mean of a multivariate normal distribution. The most important 
example of a vector parameter of interest concerns the vector mean jz of a 
multivariate normal distribution. Suppose that Y4, ..., Y, are independent and 
identically distributed p x 1 vectors having a multivariate normal distribution 
of mean u and known nonsingular covariance matrix Xo. 

Evaluation of the likelihood shows that the sample mean is a minimal suffi- 
cient statistic and that it has the same dimension as the unknown parameter. It 
can be shown by linear transformation to a set of independent random compon- 
ents that the quadratic form n(Y — WT Xo l (Y — u) has a chi-squared distribution 
with p degrees of freedom and so forms a pivot. There is thus a set of concentric 
similar ellipsoids defined by Xo such that Y lies within each with specified 
probability. Inversion of this yields confidence ellipsoids for u centred on y. 

These regions are likelihood-based in the sense that any value of u excluded 
from a confidence region has lower likelihood than all parameter values u 
included in that region. This distinguishes these regions from those of other 
shapes, for example from rectangular boxes with edges parallel to the coordinate 
axes. The elliptical regions are also invariant under arbitrary nonsingular linear 
transformation of the vectors Y. 

Quadratic forms associated in this way with the inverse of a covariance mat- 
rix occur frequently, especially in Chapter 6, and we note their geometrical 
interpretation. See Note 2.6. 


Notes 2 


Section 2.1. In some contexts the likelihood is better defined as the equivalence 
class of all functions obtained by multiplying by an arbitrary positive function 
of y. This would formalize the notion that it is really only the ratio of the 
likelihoods at two different parameter values that is inferentially meaningful. 
The importance of likelihood was emphasized quite often by R. A. Fisher, 
who also suggested that for some types of application it alone was appropri- 
ate; see Fisher (1956). For accounts of statistics putting primary emphasis on 
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likelihood as such, see Edwards (1992) and Royall (1997) and, in a particular 
time series context, Barnard ef al. (1962). The expression law of the likelihood 
was introduced by Hacking (1965). In Bayesian theory the likelihood captures 
the information directly in the data. 


Section 2.2. The Jacobian is the matrix of partial derivatives involved in any 
change of variables in a multiple integral and so, in particular, in any change of 
variables in a probability density; see any textbook on calculus. 

Sufficiency was introduced by Fisher (1922). A very mathematical treatment 
is due to Halmos and Savage (1949). The property arises in all approaches to 
statistical inference although its conceptual importance, heavily emphasized in 
the present book, depends greatly on the viewpoint taken. 


Section 2.3. The exponential family was first studied systematically in the 
1930s by G. Darmois, B. O. Koopmans and E. J. G. Pitman after some com- 
bination of whom it is sometimes named. Barndorff-Nielsen (1978) gives a 
systematic account stressing the links with convex analysis. Essentially all 
well-behaved models with sufficient statistic having the same dimension as 
the parameter space are of this simple exponential family form. The general 
books listed in Notes 1 all give some account of the exponential family. 

The cumulant (or semi-invariant) generating function is the log of the moment 
generating function, i.e., log E (eP). If it can be expanded in a power series 
Xk,p" /r!, the x, are cumulants. The first cumulant is the mean, the higher cumu- 
lants are functions of moments about the mean with «2 = var(Y). For a normal 
distribution the cumulants of order 3 and higher are zero and for the multivari- 
ate version kj; = cov (Y1, Y2). The cumulant generating function of the sum of 
independent random variables is the sum of the separate cumulant generating 
functions with the corresponding property holding also for the cumulants. 

The notation V is used for the gradient operator, forming a column vector of 
partial derivatives. It is intended here and especially in Chapter 6 to show multi- 
parameter arguments as closely parallel to the corresponding single-parameter 
versions. It is necessary only to see that VV forms a square matrix of second 
partial derivatives and later that Taylor’s theorem up to second-degree terms 
can be written 


g(x th) = g(x) + VT g(@ Ath VV! g@wh/2, 


where g is a real-valued function of the multidimensional argument x. 
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Section 2.5. Definitive accounts of the Neyman—Pearson theory are given in 
two books by E. L. Lehmann which have acquired coauthors as successive revi- 
sions have been produced (Lehmann and Casella, 2001; Lehmann and Romano, 
2004). The original papers are available in collected form (Neyman and Pearson, 
1967). 


Section 2.6. The notion of a pivot was given a central role in influential but 
largely unpublished work by G. A. Barnard. The basis of Example 2.9 and its 
further development is set out in books on multivariate analysis; see especially 
Anderson (2004). The starting point is that the vector mean Y has a multivariate 
normal distribution of mean jz and covariance matrix Uo/n, a result proved by 
generalizing the argument in Note 1.5. 

The Weak Law of Large Numbers is essentially that the average of a large 
number of independent (or not too strongly dependent) random variables is, 
under rather general conditions, with high probability close to the average of 
the expected values. It uses a particular definition of convergence, convergence 
in probability. The Strong Law, which forms a pinnacle of classical probability 
theory, uses a different notion of convergence which is, however, not relevant 
in the statistical problems treated here. See also Note 6.2. 

For the result of Example 2.9 and for many other developments the follow- 
ing is needed. If Z is a p x 1 vector of zero mean its covariance matrix is 
£z = E(ZZ"). If c is an arbitrary p x p nonsingular matrix, Łez = cze! 
and i> = (cT)! Se A direct calculation now shows that Za Z = 
(zy so. (cZ), so that if we define IZI$, =Z" 2 Z its value is invariant 
under nonsingular linear transformations. If, in particular, we choose c so that 
Èez = I, the identity matrix, the quadratic form reduces to the sum of squares 
of the components, i.e., the Euclidean squared length of the vector. If the com- 
ponents are normally distributed the distribution of the sum of squares is the 
chi-squared distribution with p degrees of freedom. 

We may thus reasonably call ||Z||», the norm of the vector Z in the metric 
defined by its covariance matrix. It is in a sense the natural measure of length 
attached to the ellipsoid defined by the covariance matrix. 
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Significance tests 


Summary. First a number of distinct situations are given in which significance 
tests may be relevant. The nature of a simple significance test is set out and 
its implications explored. The relation with interval estimation is emphasized. 
While most of the discussion is from a frequentist perspective, relations with 
Bayesian theory are outlined in the final section. 


3.1 General remarks 


So far, in our frequentist discussion we have summarized information about the 
unknown parameter w by finding procedures that would give in hypothetical 
repeated applications upper (or lower) bounds for w a specified proportion of 
times in a long run of repeated applications. This is close to but not the same 
as specifying a probability distribution for y; it avoids having to treat y as a 
random variable, and moreover as one with a known distribution in the absence 
of the data. 

Suppose now there is specified a particular value yo of the parameter of 
interest and we wish to assess the relation of the data to that value. Often 
the hypothesis that w = Yo is called the null hypothesis and conventionally 
denoted by Ho. It may, for example, assert that some effect is zero or takes on 
a value given by a theory or by previous studies, although wo does not have to 
be restricted in that way. 

There are at least six different situations in which this may arise, namely the 
following. 


e There may be some special reason for thinking that the null hypothesis 
may be exactly or approximately true or strong subject-matter interest may 
focus on establishing that it is likely to be false. 
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e There may be no special reason for thinking that the null hypothesis is true 
but it is important because it divides the parameter space into two (or more) 
regions with very different interpretations. We are then interested in 
whether the data establish reasonably clearly which region is correct, 
for example it may establish the value of sgn(y — yo). 

e Testing may be a technical device used in the process of generating 
confidence intervals. 

e Consistency with y = Wo may provide a reasoned justification for 
simplifying an otherwise rather complicated model into one that is more 
transparent and which, initially at least, may be a reasonable basis for 
interpretation. 

e Only the model when yw = Wo is under consideration as a possible model 
for interpreting the data and it has been embedded in a richer family just to 
provide a qualitative basis for assessing departure from the model. 

e Only a single model is defined, but there is a qualitative idea of the kinds of 
departure that are of potential subject-matter interest. 


The last two formulations are appropriate in particular for examining model 
adequacy. 

From time to time in the discussion it is useful to use the short-hand descrip- 
tion of Ho as being possibly true. Now in statistical terms Hp refers to a 
probability model and the very word ‘mode!’ implies idealization. With a very 
few possible exceptions it would be absurd to think that a mathematical model 
is an exact representation of a real system and in that sense all Hp are defined 
within a system which is untrue. We use the term to mean that in the current state 
of knowledge it is reasonable to proceed as if the hypothesis is true. Note that 
an underlying subject-matter hypothesis such as that a certain environmental 
exposure has absolutely no effect on a particular disease outcome might indeed 
be true. 


3.2 Simple significance test 


In the formulation of a simple significance test, we suppose available data y and 
a null hypothesis Ho that specifies the distribution of the corresponding random 
variable Y. In the first place, no other probabilistic specification is involved, 
although some notion of the type of departure from H that is of subject-matter 
concern is essential. 

The first step in testing Ho is to find a distribution for observed random 
variables that has a form which, under Ho, is free of nuisance parameters, i.e., is 
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completely known. This is trivial when there is a single unknown parameter 
whose value is precisely specified by the null hypothesis. Next find or determine 
a test statistic T, large (or extreme) values of which indicate a departure from 
the null hypothesis of subject-matter interest. Then if fobs is the observed value 
of T we define 


Pobs = P(T = tobs), (3.1) 


the probability being evaluated under Ho, to be the (observed) p-value of 
the test. 

It is conventional in many fields to report only very approximate values of 
Pobs, for example that the departure from Hp is significant just past the 1 per cent 
level, etc. 

The hypothetical frequency interpretation of such reported significance levels 
is as follows. If we were to accept the available data as just decisive evidence 
against Ho, then we would reject the hypothesis when true a long-run proportion 
Pobs Of times. 

Put more qualitatively, we examine consistency with Ho by finding the con- 
sequences of Ho, in this case a random variable with a known distribution, 
and seeing whether the prediction about its observed value is reasonably well 
fulfilled. 

We deal first with a very special case involving testing a null hypothesis that 
might be true to a close approximation. 


Example 3.1. Test ofa Poisson mean. Suppose that Y has a Poisson distribution 
of unknown mean u and that it is required to test the null hypothesis u = Uo, 
where uo is a value specified either by theory or a large amount of previous 
experience. Suppose also that only departures in the direction of larger values 
of u are of interest. Here there is no ambiguity about the choice of test statistic; 
it has to be Y or a monotone function of Y and given that Y = y the p-value is 


Pobs = Lye H ug/l (3.2) 
Now suppose that instead of a single observation we have n independent 
replicate observations so that the model is that Y;,..., Y,, have independent 


Poisson distributions all with mean u. With the same null hypothesis as before, 
there are now many possible test statistics that might be used, for example 
max(Y;). A preference for sufficient statistics leads, however, to the use of 
X Yz, which under the null hypothesis has a Poisson distribution of mean npo. 
Then the p-value is again given by (3.2), now with zo replaced by nuo and with 
y replaced by the observed value of X Yz. 


We return to this illustration in Example 3.3. The testing of a null hypothesis 
about the mean u of a normal distribution when the standard deviation oo is 
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known follows the same route. The distribution of the sample mean Y under 
the null hypothesis u = uo is now given by an integral rather than by a sum 


and (3.2) is replaced by 
y= MO 
Pobs = 1—©® (1) . (3.3) 
We now turn to a complementary use of these ideas, namely to test the 
adequacy of a given model, what is also sometimes called model criticism. 
We illustrate this by testing the adequacy of the Poisson model. It is necessary 
if we are to parallel the previous argument to find a statistic whose distribution is 
exactly or very nearly independent of the unknown parameter u. An important 
way of doing this is by appeal to the second property of sufficient statistics, 
namely that after conditioning on their observed value the remaining data have 
a fixed distribution. 


Example 3.2. Adequacy of Poisson model. Let Y,,...,¥n be independent 
Poisson variables with unknown mean jz. The null hypothesis Ho for testing 
model adequacy is that this model applies for some unknown w. Initially no 
alternative is explicitly formulated. The sufficient statistic is £ Y, so that to 
assess consistency with the model we examine the conditional distribution of 
the data given UY; = s. This density is zero if Uy, Æ s and is otherwise 

s! 1 


Myg! ns’ 


(3.4) 


i.e., is a multinomial distribution with s trials each giving a response equally 
likely to fall in one of n cells. Because this distribution is completely specified 
numerically, we are essentially in the same situation as if testing consistency 
with a null hypothesis that completely specified the distribution of the observa- 
tions free of unknown parameters. There remains, except when n = 2, the need 
to choose a test statistic. This is usually taken to be either the dispersion index 
E(Y; — ř)?/Y or the number of zeros. 

The former is equivalent to the ratio of the sample estimate of variance to the 
mean. In this conditional specification, because the sample total is fixed, the 
statistic is equivalent also to X Ye: Note that if, for example, the dispersion test 
is used, no explicit family of alternative models has been specified, only an 
indication of the kind of discrepancy that it is especially important to detect. 
A more formal and fully parametric procedure might have considered the neg- 
ative binomial distribution as representing such departures and then used the 
apparatus of the Neyman-Pearson theory of testing hypotheses to develop a test 
especially sensitive to such departures. 

A quite high proportion of the more elementary tests used in applications were 
developed by the relatively informal route just outlined. When a full family of 
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distributions is specified, covering both the null hypothesis and one or more 
alternatives representing important departures from Ho, it is natural to base the 
test on the optimal statistic for inference about y within that family. This typic- 
ally has sensitivity properties in making the random variable P corresponding 
tO Pobs Stochastically small under alternative hypotheses. 

For continuously distributed test statistics, Pops typically may take any real 
value in (0, 1). In the discrete case, however, only a discrete set of values of p 
are achievable in any particular case. Because preassigned values such as 0.05 
play no special role, the only difficulty in interpretation is the theoretical one 
of comparing alternative procedures with different sets of achievable values. 


Example 3.3. More on the Poisson distribution. For continuous observations 
the random variable P can, as already noted, in principle take any value in (0, 1). 
Suppose, however, we return to the special case that Y has a Poisson distribution 
with mean jz and that the null hypothesis u = uo is to be tested checking for 
departures in which u > uo, or more generally in which the observed random 
variable Y is stochastically larger than a Poisson-distributed random variable 
of mean uo. Then for a given observation y the p-value is 


Pis = Dye ™ uo/Y!, (3.5) 


whereas for detecting departures in the direction of small values the corres- 
ponding p-value is 


Pobs = Tage nhs: (3.6) 


Table 3.1 shows some values for the special case uo = 2. So far as use 
for a one-sided significance test is concerned, the restriction to a particular 
set of values is unimportant unless that set is in some sense embarrassingly 
small. Thus the conclusion from Table 3.1(b) that in testing u = 2 looking for 
departures u < 2 even the most extreme observation possible, namely zero, 
does not have a particularly small p-value is hardly surprising. We return to the 
implications for two-sided testing in the next section. 

In forming upper confidence limits for u based on an observed y there is no 
difficulty in finding critical values of jz such that the relevant lower tail area is 
some assigned value. Thus with y = 0 the upper 0.95 point for u is such that 
ew = 0.05, i.e., 4* = log 20 ~ 3. A similar calculation for a lower confidence 
limit is not possible for y = 0, but is possible for all other y. The discreteness 
of the set of achievable p-values is in this context largely unimportant. 
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Table 3.1. Achievable significance levels 
for testing that a Poisson-distributed random 
variable with observed value y has mean 2: 
(a) test against alternatives larger than 2; 
(b) test against alternatives less than 2 


(a) y 2 3 4 5 6 
p 0.594 0.323 0.143 0.053 0.017 
b y 0 1 2 


p 0.135 0.406 0.677 


3.3 One- and two-sided tests 


In many situations observed values of the test statistic in either tail of its distri- 
bution represent interpretable, although typically different, departures from Ho. 
The simplest procedure is then often to contemplate two tests, one for each tail, 
in effect taking the more significant, i.e., the smaller tail, as the basis for pos- 
sible interpretation. Operational interpretation of the result as a hypothetical 
error rate is achieved by doubling the corresponding p, with a slightly more 
complicated argument in the discrete case. 

More explicitly we argue as follows. With test statistic T, consider two 
p-values, namely 


Pis = P(T > t; Ho), Ps = P(T < t; Ho). (3.7) 


In general the sum of these values is 1 + P(T = t). In the two-sided case it is 
then reasonable to define a new test statistic 


Q = min (Pns Pons) (3.8) 
The level of significance is 
P(Q < obs; Ho). (3.9) 


In the continuous case this is 2qobs because two disjoint events are involved. 
In a discrete problem it is gobs plus the achievable p-value from the other tail 
of the distribution nearest to but not exceeding gobs. AS has been stressed the 
precise calculation of levels of significance is rarely if ever critical, so that 
the careful definition is more one of principle than of pressing applied import- 
ance. A more important point is that the definition is unaffected by a monotone 
transformation of T. 

In one sense very many applications of tests are essentially two-sided in 
that, even though initial interest may be in departures in one direction, it will 


36 Significance tests 


rarely be wise to disregard totally departures in the other direction, even if ini- 
tially they are unexpected. The interpretation of differences in the two directions 
may well be very different. Thus in the broad class of procedures associated 
with the linear model of Example 1.4 tests are sometimes based on the ratio 
of an estimated variance, expected to be large if real systematic effects are 
present, to an estimate essentially of error. A large ratio indicates the presence of 
systematic effects whereas a suspiciously small ratio suggests an inadequately 
specified model structure. 


3.4 Relation with acceptance and rejection 


There is a conceptual difference, but essentially no mathematical difference, 
between the discussion here and the treatment of testing as a two-decision prob- 
lem, with control over the formal error probabilities. In this we fix in principle 
the probability of rejecting Hp when it is true, usually denoted by œ, aiming to 
maximize the probability of rejecting Ho when false. This approach demands 
the explicit formulation of alternative possibilities. Essentially it amounts to 
setting in advance a threshold for pops. It is, of course, potentially appropriate 
when clear decisions are to be made, as for example in some classification prob- 
lems. The previous discussion seems to match more closely scientific practice 
in these matters, at least for those situations where analysis and interpretation 
rather than decision-making are the focus. 

That is, there is a distinction between the Neyman—Pearson formulation of 
testing regarded as clarifying the meaning of statistical significance via hypo- 
thetical repetitions and that same theory regarded as in effect an instruction on 
how to implement the ideas by choosing a suitable œ in advance and reaching 
different decisions accordingly. The interpretation to be attached to accepting 
or rejecting a hypothesis is strongly context-dependent; the point at stake here, 
however, is more a question of the distinction between assessing evidence, as 
contrasted with deciding by a formal rule which of two directions to take. 


3.5 Formulation of alternatives and test statistics 


As set out above, the simplest version of a significance test involves formula- 
tion of a null hypothesis Hp and a test statistic T, large, or possibly extreme, 
values of which point against Ho. Choice of T is crucial in specifying the kinds 
of departure from Ho of concern. In this first formulation no alternative prob- 
ability models are explicitly formulated; an implicit family of possibilities is 
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specified via T. In fact many quite widely used statistical tests were developed in 
this way. 

A second possibility is that the null hypothesis corresponds to a particular 
parameter value, say Y = Wo, in a family of models and the departures of main 
interest correspond either, in the one-dimensional case, to one-sided alternatives 
w > Wo or, more generally, to alternatives w + wo. This formulation will 
suggest the most sensitive test statistic, essentially equivalent to the best estimate 
of y, and in the Neyman—Pearson formulation such an explicit formulation of 
alternatives is essential. 

The approaches are, however, not quite as disparate as they may seem. 
Let fo(y) denote the density of the observations under Ho. Then we may 
associate with a proposed test statistic T the exponential family 


foy) exp{t6 — k(6)}, (3.10) 


where k(@) is a normalizing constant. Then the test of 9 = 0 most sensitive to 
these departures is based on T. Not all useful tests appear natural when viewed 
in this way, however; see, for instance, Example 3.5. 

Many of the test procedures for examining model adequacy that are provided 
in standard software are best regarded as defined directly by the test statistic 
used rather than by a family of alternatives. In principle, as emphasized above, 
the null hypothesis is the conditional distribution of the data given the sufficient 
statistic for the parameters in the model. Then, within that null distribution, 
interesting directions of departure are identified. 

The important distinction is between situations in which a whole family of 
distributions arises naturally as a base for analysis versus those where analysis 
is at a stage where only the null hypothesis is of explicit interest. 

Tests where the null hypotheses itself is formulated in terms of arbitrary dis- 
tributions, so-called nonparametric or distribution-free tests, illustrate the use of 
test statistics that are formulated largely or wholly informally, without specific 
probabilistically formulated alternatives in mind. To illustrate the arguments 
involved, consider initially a single homogenous set of observations. 

That is, let Yj,...,Y, be independent and identically distributed random 
variables with arbitrary cumulative distribution function F(y) = P(Y, < y). 
To avoid minor complications we suppose throughout the following discussion 
that the distribution is continuous so that, in particular, the possibility of ties, 
i.e., exactly equal pairs of observations, can be ignored. A formal likelihood 
can be obtained by dividing the real line into a very large number of very small 
intervals each having an arbitrary probability attached to it. The likelihood for 
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data y can then be specified in a number of ways, namely by 


e alist of data values actually observed, 
e the sample cumulative distribution function, defined as 


Fily) =n! ZI < y), (3.11) 


where the indicator function I (yọ < y) is one if yg < y and zero otherwise, 
e the set of order statistics ya) < ya) < +++ < Ym, 1.€., the observed values 
arranged in increasing order. 


The second and third of these are reductions of the first, suppressing the 
information about the order in which the observations are obtained. In general 
no further reduction is possible and so either of the last two forms the suf- 
ficient statistic. Thus if we apply the general prescription, conclusions about 
the function F(y) are to be based on one of the above, for example on the 
sample distribution function, whereas consistency with the model is examined 
via the conditional distribution given the order statistics. This conditional dis- 
tribution specifies that starting from the original data, all n! permutations of the 
data are equally likely. This leads to tests in general termed permutation tests. 
This idea is now applied to slightly more complicated situations. 


Example 3.4. Test of symmetry. Suppose that the null hypothesis is that the 
distribution is symmetric about a known point, which we may take to be zero. 
That is, under the null hypothesis F(—y) + F(y) = 1. Under this hypothesis, 
all points y and —y have equal probability, so that the sufficient statistic is 
determined by the order statistics or sample distribution function of the |y;|. 
Further, conditionally on the sufficient statistic, all 2” sample points +y; have 
equal probability 2~”. Thus the distribution under the null hypothesis of any 
test statistic is in principle exactly known. 

Simple one-sided test statistics for symmetry can be based on the number 
of positive observations, leading to the sign test, whose null distribution is 
binomial with parameter 5 and index n, or on the mean of all observations. 
The distribution of the latter can be found by enumeration or approximated by 
finding its first few moments and using a normal or other approximation to the 
distribution. 

This formulation is relevant, for example, when the observations analyzed are 
differences between primary observations after and before some intervention, or 
are differences obtained from the same individual under two different regimes. 

A strength of such procedures is that they involve no specification of the 
functional form of F(y). They do, however, involve strong independence 
assumptions, which may often be more critical. Moreover, they do not extend 
easily to relatively complicated models. 
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Example 3.5. Nonparametric two-sample test. Let (Y\1,...,Yin,; Y21,---. 
Y2n,) be two sets of mutually independent random variables with cumulative dis- 
tribution functions respectively F\(y) and F2(y). Consider the null hypothesis 
Fı (y) = F2(y) for all y. 

When this hypothesis is true the sufficient statistic is the set of order statistics 
of the combined set of observations and under this hypothesis all (nı + n2)! 
permutations of the data are equally likely and, in particular, the first set of nı 
observations is in effect a random sample drawn without replacement from the 
full set, allowing the null distribution of any test statistic to be found. 

Sometimes it may be considered that while the ordering of possible obser- 
vational points is meaningful the labelling by numerical values is not. Then 
we look for procedures invariant under arbitrary strictly monotone increasing 
transformations of the measurement scale. This is achieved by replacing the 
individual observations y by their rank order in the full set of order statistics of 
the combined sample. If the test statistic, Tw, is the sum of the ranks of, say, 
the first sample, the resulting test is called the Wilcoxon rank sum test and the 
parallel test for the single-sample symmetry problem is the Wilcoxon signed 
rank test. 

The distribution of the test statistic under the null hypothesis, and hence the 
level of significance, is in principle found by enumeration. The moments of Tw 
under the null hypothesis can be found by the arguments to be developed in 
a rather different context in Section 9.2 and in fact the mean and variance are 
respectively nı (nı + n2 + 1)/2 and nım (nı + n2 + 1)/12. A normal approx- 
imation, with continuity correction, based on this mean and variance will often 
be adequate. 

Throughout this discussion the full set of values of y is regarded as fixed. 


The choice of test statistic in these arguments is based on informal consider- 
ations or broad analogy. Sometimes, however, the choice can be sharpened 
by requiring good sensitivity of the test were the data produced by some 
unknown monotonic transformation of data following a specific parametric 
model. 

For the two-sample problem, the most obvious possibilities are that the data 
are transformed from underlying normal or underlying exponential distribu- 
tions, a test of equality of the relevant means being required in each case. The 
exponential model is potentially relevant for the analysis of survival data. Up to 
a scale and location change in the normal case and up to a scale change in 
the exponential case, the originating data can then be reconstructed approx- 
imately under the null hypothesis by replacing the rth largest order statistic 
out of n in the data by the expected value of that order statistic in samples 
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from the standard normal or unit exponential distribution respectively. Then 
the standard parametric test statistics are used, relying in principle on their 
permutation distribution to preserve the nonparametric propriety of the test. 

It can be shown that, purely as a test procedure, the loss of sensitivity of 
the resulting nonparametric analysis as compared with the fully parametric 
analysis, were it available, is usually small. In the normal case the expected order 
statistics, called Fisher and Yates scores, are tabulated, or can be approximated 
by @!{(r —3 /8)/(n + 1/4)}. For the exponential distribution the scores can 
be given in explicit form or approximated by log{(n + 1)/(n + 1 — r — 1/2)}. 


3.6 Relation with interval estimation 


While conceptually it may seem simplest to regard estimation with uncertainty 
as a simpler and more direct mode of analysis than significance testing there are 
some important advantages, especially in dealing with relatively complicated 
problems, in arguing in the other direction. Essentially confidence intervals, or 
more generally confidence sets, can be produced by testing consistency with 
every possible value in Qy and taking all those values not ‘rejected’ at level c, 
say, to produce a 1 — c level interval or region. This procedure has the property 
that in repeated applications any true value of w will be included in the region 
except in a proportion 1 — c of cases. This can be done at various levels c, using 
the same form of test throughout. 


Example 3.6. Ratio of normal means. Given two independent sets of random 
variables from normal distributions of unknown means Wo, yı and with known 
variance og, we first reduce by sufficiency to the sample means yọ, y1. Suppose 
that the parameter of interest is Y = uı/uo. Consider the null hypothesis 
w = po. Then we look for a statistic with a distribution under the null hypothesis 
that does not depend on the nuisance parameter. Such a statistic is 
Yı — YoYo : 
ooJ/(1/m + Wo /n0)’ 
this has a standard normal distribution under the null hypothesis. This with 
Wo replaced by y could be treated as a pivot provided that we can treat Yo as 
positive. 

Note that provided the two distributions have the same variance a similar 
result with the Student f distribution replacing the standard normal would apply 
if the variance were unknown and had to be estimated. To treat the probably 
more realistic situation where the two distributions have different and unknown 
variances requires the approximate techniques of Chapter 6. 


(3.12) 
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We now form a | — c level confidence region by taking all those values of Wo 
that would not be ‘rejected’ at level c in this test. That is, we take the set 


i Ëi- y)? <tc. 


og (1 /m + 2/19) T " (3.13) 
where ky. c İs the upper c point of the chi-squared distribution with one degree 
of freedom. 

Thus we find the limits for w as the roots of a quadratic equation. If there 
are no real roots, all values of yy are consistent with the data at the level in 
question. If the numerator and especially the denominator are poorly determ- 
ined, a confidence interval consisting of the whole line may be the only rational 
conclusion to be drawn and is entirely reasonable from a testing point of view, 
even though regarded from a confidence interval perspective it may, wrongly, 
seem like a vacuous statement. 


Depending on the context, emphasis may lie on the possible explanations 
of the data that are reasonably consistent with the data or on those possible 
explanations that have been reasonably firmly refuted. 


Example 3.7. Poisson-distributed signal with additive noise. Suppose that Y 
has a Poisson distribution with mean u + a, where a > O is a known con- 
stant representing a background process of noise whose rate of occurrence has 
been estimated with high precision in a separate study. The parameter u corres- 
ponds to a signal of interest. Now if, for example, y = 0 and a is appreciable, 
for example, a > 4, when we test consistency with each possible value of u 
all values of the parameter are inconsistent with the data until we use very small 
values of c. For example, the 95 per cent confidence interval will be empty. Now 
in terms of the initial formulation of confidence intervals, in which, in partic- 
ular, the model is taken as a firm basis for analysis, this amounts to making 
a statement that is certainly wrong; there is by supposition some value of the 
parameter that generated the data. On the other hand, regarded as a statement of 
which values of jz are consistent with the data at such-and-such a level the state- 
ment is perfectly reasonable and indeed is arguably the only sensible frequentist 
conclusion possible at that level of c. 


3.7 Interpretation of significance tests 


There is a large and ever-increasing literature on the use and misuse of 
significance tests. This centres on such points as: 


1. Often the null hypothesis is almost certainly false, inviting the question 
why is it worth testing it? 
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2. Estimation of y is usually more enlightening than testing hypotheses 
about y. 

3. Failure to ‘reject’ Ho does not mean that we necessarily consider Ho to be 
exactly or even nearly true. 

4. If tests show that data are consistent with Hp and inconsistent with the 
minimal departures from Ho considered as of subject-matter importance, 
then this may be taken as positive support for Ho, i.e., as more than mere 
consistency with Ho. 

5. With large amounts of data small departures from Ho of no subject-matter 
importance may be highly significant. 

6. When there are several or many somewhat similar sets of data from 
different sources bearing on the same issue, separate significance 
tests for each data source on its own are usually best avoided. They 
address individual questions in isolation and this is often 
inappropriate. 

7. Pobs is not the probability that Ho is true. 


Discussion of these points would take us too far afield. Point 7 addresses 
a clear misconception. The other points are largely concerned with how in 
applications such tests are most fruitfully applied and with the close connection 
between tests and interval estimation. The latter theme is emphasized below. 
The essential point is that significance tests in the first place address the ques- 
tion of whether the data are reasonably consistent with a null hypothesis in 
the respect tested. This is in many contexts an interesting but limited ques- 
tion. The much fuller specification needed to develop confidence limits by this 
route leads to much more informative summaries of what the data plus model 
assumptions imply. 


3.8 Bayesian testing 


A Bayesian discussion of significance testing is available only when a full family 
of models is available. We work with the posterior distribution of y. When the 
null hypothesis is quite possibly exactly or nearly correct we specify a prior 
probability zo that Hp is true; we need also to specify the conditional prior 
distribution of y when Ho is false, as well as aspects of the prior distribution 
concerning nuisance parameters A. Some care is needed here because the issue 
of testing is not likely to arise when massive easily detected differences are 
present. Thus when, say, y can be estimated with a standard error of o9/,/n 
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the conditional prior should have standard deviation bog /./n, for some not too 
large value of b. 

When the role of Ho is to divide the parameter space into qualitatively 
different parts the discussion essentially is equivalent to checking whether 
the posterior interval at some suitable level overlaps the null hypothesis value 
of y. If and only if there is no overlap the region containing y is reason- 
ably firmly established. In simple situations, such as that of Example 1.1, 
posterior and confidence intervals are in exact or approximate agreement 
when flat priors are used, providing in such problems some formal justifica- 
tion for the use of flat priors or, from a different perspective, for confidence 
intervals. 

We defer to Section 5.12 the general principles that apply to the choice of 
prior distributions, in particular as they affect both types of testing problems 
mentioned here. 


Notes 3 


Section 3.1. The explicit classification of types of null hypothesis is developed 
from Cox (1977). 


Section 3.2. The use of the conditional distribution to test conformity with 
a Poisson distribution follows Fisher (1950). Another route for dealing with 
discrete distributions is to define (Stone, 1969) the p-value for test statistic T 
by P(T > fobs) + P(T = tobs)/2. This produces a statistic having more nearly 
a uniform distribution under the null hypothesis but the motivating operational 
meaning has been sacrificed. 


Section 3.3. There are a number of ways of defining two-sided p-values for 
discrete distributions; see, for example, Cox and Hinkley (1974, p.79). 


Section 3.4. The contrast made here between the calculation of p-values as 
measures of evidence of consistency and the more decision-focused emphasis 
on accepting and rejecting hypotheses might be taken as one characteristic 
difference between the Fisherian and the Neyman—Pearson formulations of 
statistical theory. While this is in some respects the case, the actual practice in 
specific applications as between Fisher and Neyman was almost the reverse. 
Neyman often in effect reported p-values whereas some of Fisher’s use of tests 
in applications was much more dichotomous. For a discussion of the notion 
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of severity of tests, and the circumstances when consistency with Ho might be 
taken as positive support for Hp, see Mayo (1996). 


Section 3.5. For a thorough account of nonparametric tests, see Lehmann 
(1998). 


Section 3.6. The argument for the ratio of normal means is due to E. C. Fieller, 
after whom it is commonly named. The result applies immediately to the ratio 
of least squares regression coefficients and hence in particular to estimating 
the intercept of a regression line on the z-coordinate axis. A substantial dispute 
developed over how this problem should be handled from the point of view of 
Fisher’s fiducial theory, which mathematically but not conceptually amounts 
to putting flat priors on the two means (Creasy, 1954). There are important 
extensions of the situation, for example to inverse regression or (controlled) 
calibration, where on the basis of a fitted regression equation it is desired to 
estimate the value of an explanatory variable that generated a new value of the 
response variable. 


Section 3.8. For more on Bayesian tests, see Jeffreys (1961) and also 
Section 6.2.6 and Notes 6.2. 


4 


More complicated situations 


Summary. This chapter continues the comparative discussion of frequentist 
and Bayesian arguments by examining rather more complicated situations. In 
particular several versions of the two-by-two contingency table are compared 
and further developments indicated. More complicated Bayesian problems are 
discussed. 


4.1 General remarks 


The previous frequentist discussion in especially Chapter 3 yields a theoretical 
approach which is limited in two senses. It is restricted to problems with no 
nuisance parameters or ones in which elimination of nuisance parameters is 
straightforward. An important step in generalizing the discussion is to extend 
the notion of a Fisherian reduction. Then we turn to a more systematic discussion 
of the role of nuisance parameters. 

By comparison, as noted previously in Section 1.5, a great formal advant- 
age of the Bayesian formulation is that, once the formulation is accepted, all 
subsequent problems are computational and the simplifications consequent on 
sufficiency serve only to ease calculations. 


4.2 General Bayesian formulation 


The argument outlined in Section 1.5 for inference about the mean of a normal 
distribution can be generalized as follows. Consider the model fyj@(y | 8), 
where, because we are going to treat the unknown parameter as a random vari- 
able, we now regard the model for the data-generating process as a conditional 
density. Suppose that © has the prior density fo(@), specifying the marginal 
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distribution of the parameter, i.e., in effect the distribution © has when the 
observations y are not available. 

Given the data and the above formulation it is reasonable to assume that 
all information about @ is contained in the conditional distribution of © given 
Y = y. We call this the posterior distribution of © and calculate it by the 
standard laws of probability theory, as given in (1.12), by 


frio | 8) fe(@) 
Joa, frie |b) fo@)dg- 


The main problem in computing this lies in evaluating the normalizing constant 


fov@ly= (4.1) 


in the denominator, especially if the dimension of the parameter space is high. 

Finally, to isolate the information about the parameter of interest y, we 
marginalize the posterior distribution over the nuisance parameter à. That is, 
writing 0 = (W,A), we consider 


Arein= J Tarra (4.2) 


The models and parameters for which this leads to simple explicit solutions are 
broadly those for which frequentist inference yields simple solutions. 

Because in the formula for the posterior density the prior density enters both 
in the numerator and the denominator, formal multiplication of the prior density 
by a constant would leave the answer unchanged. That is, there is no need for 
the prior measure to be normalized to | so that, formally at least, improper, i.e. 
divergent, prior densities may be used, always provided proper posterior dens- 
ities result. A simple example in the context of the normal mean, Section 1.5, 
is to take as prior density element fiy(j1)du just du. This could be regarded 
as the limit of the proper normal prior with variance v taken as v > oo. Such 
limits raise few problems in simple cases, but in complicated multiparameter 
problems considerable care would be needed were such limiting notions con- 
templated. There results here a posterior distribution for the mean that is normal 
with mean y and variance og /n, leading to posterior limits numerically identical 
to confidence limits. 

In fact, with a scalar parameter it is possible in some generality to find a prior 
giving very close agreement with corresponding confidence intervals. With 
multidimensional parameters this is not in general possible and naive use of 
flat priors can lead to procedures that are very poor from all perspectives; see 
Example 5.5. 

An immediate consequence of the Bayesian formulation is that for a given 
prior distribution and model the data enter only via the likelihood. More spe- 
cifically for a given prior and model if y and y’ are two sets of values with 
proportional likelihoods the induced posterior distributions are the same. This 
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forms what may be called the weak likelihood principle. Less immediately, 
if two different models but with parameters having the same interpretation, 
but possibly even referring to different observational systems, lead to data y 
and y’ with proportional likelihoods, then again the posterior distributions are 
identical. This forms the less compelling strong likelihood principle. Most fre- 
quentist methods do not obey this latter principle, although the departure is 
usually relatively minor. If the models refer to different random systems, the 
implicit prior knowledge may, in any case, be different. 


4.3 Frequentist analysis 


4.3.1 Extended Fisherian reduction 


One approach to simple problems is essentially that of Section 2.5 and can be 
summarized, as before, in the Fisherian reduction: 


find the likelihood function; 

reduce to a sufficient statistic S of the same dimension as 0; 

find a function of S that has a distribution depending only on w; 

place it in pivotal form or alternatively use it to derive p-values for null 
hypotheses; 

invert to obtain limits for y at an arbitrary set of probability levels. 


There is sometimes an extension of the method that works when the model 
is of the (k,d) curved exponential family form. Then the sufficient statistic is 
of dimension k greater than d, the dimension of the parameter space. We then 
proceed as follows: 


if possible, rewrite the k-dimensional sufficient statistic, when k > d, in the 
form (S,A) such that S is of dimension d and A has a distribution not 
depending on 6; 

consider the distribution of S given A = a and proceed as before. The 
statistic A is called ancillary. 


There are limitations to these methods. In particular a suitable A may not 
exist, and then one is driven to asymptotic, i.e., approximate, arguments for 
problems of reasonable complexity and sometimes even for simple problems. 

We give some examples, the first of which is not of exponential family form. 


Example 4.1. Uniform distribution of known range. Suppose that (Y1, . . . , Yn) 
are independently and identically distributed in the uniform distribution over 
(0—1,0 + 1). The likelihood takes the constant value 2~” provided the smallest 
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and largest values (y(1),¥(n)) lie within the range (9 — 1,0 + 1) and is zero 
otherwise. The minimal sufficient statistic is of dimension 2, even though the 
parameter is only of dimension 1. The model is a special case of a location family 
and it follows from the invariance properties of such models that A = Yn) — Y(1) 
has a distribution independent of 0. 

This example shows the imperative of explicit or implicit conditioning on 
the observed value a of A in quite compelling form. If a is approximately 2, 
only values of 0 very close to y* = (ya) + y~m))/2 are consistent with the data. 
If, on the other hand, a is very small, all values in the range of the common 
observed value y* plus and minus 1 are consistent with the data. In general, the 
conditional distribution of Y* given A = a is found as follows. 

The joint density of (Y(1), Y(n)) is 


n(n — 1)(y(n) — yay)” 2/2" (4.3) 


and the transformation to new variables (Y*, A = Y(n) — Y1)) has unit Jacobian. 
Therefore the new variables (Y*, A) have density n(n— Da"? /2” defined over 
the triangular region (0 < a < 2;0—1+a/2 < y* < 0+ 1-—a/2) and density 
zero elsewhere. This implies that the conditional density of Y* given A = a is 
uniform over the allowable interval 0 — 1 + a/2 < y* < 0 + 1 — a/2. 

Conditional confidence interval statements can now be constructed although 
they add little to the statement just made, in effect that every value of 0 in the 
relevant interval is in some sense equally consistent with the data. The key point 
is that an interval statement assessed by its unconditional distribution could be 
formed that would give the correct marginal frequency of coverage but that 
would hide the fact that for some samples very precise statements are possible 
whereas for others only low precision is achievable. 


Example 4.2. Two measuring instruments. A closely related point is made 
by the following idealized example. Suppose that a single observation Y is 
made on a normally distributed random variable of unknown mean u. There 
are available two measuring instruments, one with known small variance, say 
og = 1074, and one with known large variance, say o? = 10*. A randomizing 
device chooses an instrument with probability 1/2 for each possibility and the 
full data consist of the observation y and an identifier d = 0, 1 to show which 
variance is involved. The log likelihood is 


log oa — exp{-(y — u) / Q02), (4.4) 


so that (y, d) forms the sufficient statistic and d is ancillary, suggesting again that 
the formation of confidence intervals or the evaluation of a p-value should use 


the variance belonging to the apparatus actually used. If the sensitive apparatus, 
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d = 0, is in fact used, why should the interpretation be affected by the possibility 
that one might have used some other instrument? 

There is a distinction between this and the previous example in that in the 
former the conditioning arose out of the mathematical structure of the problem, 
whereas in the present example the ancillary statistic arises from a physical 
distinction between two measuring devices. 


There is a further important point suggested by this example. The fact that the 
randomizing probability is assumed known does not seem material. The argu- 
ment for conditioning is equally compelling if the choice between the two sets 
of apparatus is made at random with some unknown probability, provided only 
that the value of the probability, and of course the outcome of the randomization, 
is unconnected with jz. 

More formally, suppose that we have factorization in which the distribution 
of S given A = a depends only on 6, whereas the distribution of A depends on 
an additional parameter y such that the parameter space becomes Qg x Qy,, 
so that 0 and y are variation-independent in the sense introduced in Section 
1.1. Then A is ancillary in the extended sense for inference about 0. The term 
S-ancillarity is often used for this idea. 


Example 4.3. Linear model. The previous examples are in a sense a prelim- 
inary to the following widely occurring situation. Consider the linear model of 
Examples 1.4 and 2.2 in which the n x 1 vector Y has expectation E(Y) = zB, 
where z is an n x q matrix of full rank q < n and where the components are 
independently normally distributed with variance o*. Suppose that, instead of 
the assumption of the previous discussion that z is a matrix of fixed constants, 
we suppose that z is the realized value of a random matrix Z with a known 
probability distribution. 
The log likelihood is by (2.17) 


—nlogo — {(y — zB)" (y — zB) + (Ê — BT (2B — B) / Q20?) (4.5) 


plus in general functions of z arising from the known distribution of Z. Thus 
the minimal sufficient statistic includes the residual sum of squares and the 
least squares estimates as before but also functions of z, in particular zz. Thus 
conditioning on z, or especially on z” z, is indicated. This matrix, which specifies 
the precision of the least squares estimates, plays the role of the distinction 
between the two measuring instruments in the preceding example. 

As noted in the previous discussion and using the extended definition 
of ancillarity, the argument for conditioning is unaltered if Z has a prob- 
ability distribution fz(z;y), where y and the parameter of interest are 
variation-independent. 
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In many experimental studies the explanatory variables would be chosen 
by the investigator systematically and treating z as a fixed matrix is totally 
appropriate. In observational studies in which study individuals are chosen in a 
random way all variables are random and so modelling z as a random variable 
might seem natural. The discussion of extended ancillarity shows when that is 
unnecessary and the standard practice of treating the explanatory variables as 
fixed is then justified. This does not mean that the distribution of the explanatory 
variables is totally uninformative about what is taken as the primary focus 
of concern, namely the dependence of Y on z. In addition to specifying via 
the matrix (zz)! the precision of the least squares estimates, comparison of 
the distribution of the components of z with their distribution in some target 
population may provide evidence of the reliability of the sampling procedure 
used and of the security of any extrapolation of the conclusions. In a comparable 
Bayesian analysis the corresponding assumption would be the stronger one that 
the prior densities of 0 and y are independent. 


4.4 Some more general frequentist developments 


4.4.1 Exponential family problems 


Some limited but important classes of problems have a formally exact solution 
within the frequentist approach. We begin with two situations involving expo- 
nential family distributions. For simplicity we suppose that the parameter y of 
interest is one-dimensional. 

We start with a full exponential family model in which w is a linear combin- 
ation of components of the canonical parameter. After linear transformation of 
the parameter and canonical statistic we can write the density of the observations 
in the form 


m(y)explsy Y + sià — kW. A}, (4.6) 
where (sy, 5)) is a partitioning of the sufficient statistic corresponding to the 
partitioning of the parameter. For inference about y we prefer a pivotal distri- 
bution not depending on A. It can be shown via the mathematical property of 
completeness, essentially that the parameter space is rich enough, that separa- 
tion from à can be achieved only by working with the conditional distribution 
of the data given S} = sı}. That is, we evaluate the conditional distribution of 
Sy given S} = s, and use the resulting distribution to find which values of y 
are consistent with the data at various levels. 

Many of the standard procedures dealing with simple problems about 
Poisson, binomial, gamma and normal distributions can be found by this route. 
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We illustrate the arguments sketched above by a number of problems con- 
nected with binomial distributions, noting that the canonical parameter of a 
binomial distribution with probability x is @ = log{x/(1 — 7 )}. This has some 
advantages in interpretation and also disadvantages. When xr is small, the para- 
meter is essentially equivalent to log 7 meaning that differences in canonical 
parameter are equivalent to ratios of probabilities. 


Example 4.4. Two-by-two contingency table. Here there are data on n indi- 
viduals for each of which two binary features are available, for example blue or 
nonblue eyes, dead or survived, and so on. Throughout random variables refer- 
ring to different individuals are assumed to be mutually independent. While 
this is in some respects a very simple form of data, there are in fact at least four 
distinct models for this situation. Three are discussed in the present section and 
one, based on randomization used in design, is deferred to Chapter 9. 

These models lead, to some extent, to the same statistical procedures, at least 
for examining the presence of a relation between the binary features, although 
derived by slightly different arguments. From an interpretational point of view, 
however, the models are different. The data are in the form of four frequencies, 
summing to n, and arranged in a 2 x 2 array. 

In the first place all the Ry; in Table 4.1 are random variables with observed 
values rx. In some of the later developments one or more marginal frequencies 
are regarded as fixed and we shall adapt the notation to reflect that. 

It helps to motivate the following discussion to consider some ways in which 
such data might arise. First, suppose that Ry; represents the number of occur- 
rences of randomly occurring events in a population over a specified period 
and that k = 0,1 specifies the gender of the individual and / = 0, 1 specifies 
presence or absence of a risk factor for the event. Then a reasonable model 
is that the Ry; are independent random variables with Poisson distributions. It 
follows that also the total frequency R, has a Poisson distribution. 

Next, suppose that the rows and columns correspond to two separate features 
of individuals to be treated on an equal footing, say eye colour and hair colour, 


Table 4.1. General notation 
for 2 x 2 contingency table 


Roo Roi Ro, 
Rio Ry Ri, 


Ro Ry R =n 
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both regarded as binary for simplicity. Suppose that n independent individuals 
are chosen and the two features recorded for each. Then the frequencies in 
Table 4.1 have a multinomial distribution with four cells. 

Finally, suppose that the rows of the table correspond to two treatments and 
that nx, individuals are allocated to treatment k fork = 0, 1 anda binary outcome 
variable, failure or success, say, recorded. Then Ry is the number of responses 
l among the individuals allocated to treatment k. 

In the first situation the objective is to compare the Poisson means, in the 
second to examine possible association between the features and in the third to 
compare two probabilities of success. The essential equivalence of the analytical 
procedures in these situations arises out of successive processes of conditioning; 
the interpretations, concerning as they do different specifications, are different. 

In the first model the Ry are independent random variables with Poisson 
distributions of means ux and canonical parameters dy; = log ug. A possible 
hypothesis of interest is that the Poisson means have multiplicative form, i.e., 
the canonical parameter is a row effect plus a column effect. To test departures 
from this model involves the parameter 


y = log u11 — log p19 — log 101 + log uoo (4.7) 


which is zero in the multiplicative case. The log likelihood may be written in 
exponential family form and the canonical parameters, log u, may be replaced 
by y and three nuisance parameters. For inference about w via a distribution 
not depending on the nuisance parameters, it is necessary to condition on the 
marginal totals (Ro, R1., R.o, R.1), implying also conditioning on the total num- 
ber of observations R... The full conditional distribution will be given later in 
this discussion. 

Note first, however, that the marginal distribution of R., is a Poisson distri- 
bution of mean È ug. Therefore the conditional distribution of the Ry given 
R =nis 


TI exp(— uki) u} /Te 


> (4.8) 
exp(— Lg) (E ur)” /n! 
so that the conditional distribution has the multinomial form 
kL 
final Ty > (4.9) 
forr, = Xrg = n, say, and is zero otherwise. Here 
Tet = Ukl / È hst. (4.10) 


Further the standard cross-product ratio measure of association (711700)/ 
(719701) is equal to e”. The parameter space is now three-dimensional in the 
light of the constraint Ery = 1. 
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Now suppose as a second possibility that a sample of preassigned size n is 
taken and interest focuses on the joint distribution of two binary features treated 
on an equal footing. The multinomial distribution just calculated would now be 
the initial model for analysis and w would be a suitable measure of association. 
Examination of the likelihood again shows that a distribution depending only 
on y is obtained by conditioning on the row and column totals. 

Further, if we condition first on the row totals the resulting distribution 
corresponds to two independent binomial distributions with effective sample 


sizes no, = ro, and nj, = rı.. The corresponding binomial probabilities are 
Tk. = We / (Kot) fork = 0, 1. The corresponding canonical parameters are 
log{zx./(1 — x.)} = log (zK1/7x0). (4.11) 


Thus the difference of canonical parameters is y. 

In the preceding formulation there is a sample of size n in which two binary 
features are recorded and treated on an equal footing. Now suppose that instead 
the rows of the table represent an explanatory feature, such as alternative treat- 
ments and that samples of size nọ, = ro, and nı, = rı., say, are taken from 
the relevant binomial distributions, which have corresponding probabilities of 
outcome 1 equal to 7, and 7x1.. The likelihood is the product of two binomial 
terms 


nk: rko = Tki 
I] my (1 — my). (4.12) 
rro!rki! ` 


On writing this in exponential family form and noting, as above, that y is 
a difference of canonical parameters, we have that a distribution for inference 
about y is obtained by conditioning further on the column totals R.o and R1, so 
that the distribution is that of, say, R1; given the column totals, the row totals 
already being fixed by design. That is, conditioning is now on the full set M 
of margins, namely Ro, R1, Ro., R1.. 
This conditional distribution can be written 
Crj Cre"! Y 
EC yC ent 


1-w~w 


PRi =r | M= (4.13) 
Here C! denotes the number of combinations of t items taken s at a time, namely 
t!/{s!(t — s)!}. The distribution is a generalized hypergeometric distribution, 
reducing to an ordinary hypergeometric distribution in the special case w = 0, 
when the sum in the denominator becomes C ra, The significance test based 
on this is called Fisher’s exact test. The mean and variance of the hypergeometric 
distribution are respectively 
Lr] TOTLTOFA 


> 


n n2(n— 1)" 


(4.14) 
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We thus have three different models respectively concerning Poisson distribu- 
tions, concerning association in a multinomial distribution and concerning the 
comparison of binomial parameters that lead after conditioning to the same stat- 
istical procedure. It is helpful to distinguish conditioning by model formulation 
from technical conditioning. Thus in the comparison of two binomial para- 
meters the conditioning on row totals (sample sizes) is by model formulation, 
whereas the conditioning on column totals is technical to derive a distribution 
not depending on a nuisance parameter. 

A practical consequence of the above results is that computerized procedures 
for any of the models can be used interchangeably in this and related more 
general situations. Due care in interpretation is of course crucial. In particular 
methods for log Poisson regression, i.e., the analysis of multiplicative models 
for Poisson means, can be used much more widely than might seem at first 
sight. 

Later, in Example 9.1, we shall have a version in which all the conditioning 
is introduced by model formulation. 


Example 4.5. Mantel-Haenszel procedure. Suppose that there are m inde- 
pendent sets of binary data comparing the same two treatments and thus each 
producing a two-by-two table of the kind analyzed in Example 4.4 in the final 
formulation. If we constrain the probabilities in the tables only by requiring 
that all have the same logistic difference y, then the canonical parameters in 
the kth table can be written 


Pko =O, Qk =Or+y. (4.15) 


That is, the relevant probabilities of a response 1 in the two rows of the kth table 
are, respectively, 


er (1+ e%), MEV IL 4 CTY), (4.16) 


It follows from the discussion for a single table that for inference about 7 treat- 
ing the a; as arbitrary nuisance parameters, technical conditioning is needed on 
the column totals of each table separately as well as on the separate row totals 
which were fixed by design. Note that if further information were available 
to constrain the a, in some way, in particular to destroy variation independ- 
ence, the conditioning requirement would, in principle at least, be changed. The 
component of the canonical statistic attached to w is URx11. 

It follows that the distribution for inference about yf is the convolution of 
m generalized hypergeometric distributions (4.13). To test the null hypothesis 
w = 0 the test statistic &XRķg,ıı can be used and has a distribution formed by 
convoluting the separate hypergeometric distributions. In fact unless both m and 
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the individual tables are in some sense small, a normal approximation based on 
adding separate means and variances (4.14) will be satisfactory. 

The conditional analysis would be identical if starting from a model in which 
all cell frequencies had independent Poisson distributions with means specified 
by a suitable product model. 


Example 4.6. Simple regression for binary data. Suppose that Y\,..., Y, are 
independent binary random variables such that 

P(Y%y = 1) = TPA + eet), 

(4.17) 

P(Y = 0) = 1/01 + eX P*), 
where the notation of Example 1.2 has been used to emphasize the connection 
with simple normal-theory linear regression. The log likelihood can be written 
in the form 


ady, + BEygzk — Dlog(1 + e%*F*), (4.18) 


so that the sufficient statistic is r = Xyz, the total number of 1s, and s = Vyxzz, 
the sum of the zg over the individuals with outcome 1. The latter is conditionally 
equivalent to the least squares regression coefficient of Y on z. 

For inference about £ we consider the conditional density of S given £ Y; = r, 
namely 


cr(s)e? /Xe,-(v)eP”, (4.19) 


where the combinatorial coefficient c,(s) can be interpreted as the number (or 
proportion) of distinct possible samples of size r drawn without replacement 
from the finite population {z1,...,Z,} and having total s. 

In the particular case of testing the null hypothesis that 6 = O an exact 
test of significance is obtained by enumeration of the possible samples or via 
some approximation. This is a generalization of Fisher’s exact test fora 2 x 2 
contingency table, the special case where the z are themselves binary. 


The above examples have concerned inference about a linear combination 
of the canonical parameters, which can thus be taken as a single component of 
that parameter. A rather less elegant procedure deals with the ratio of canonical 
parameters. For this suppose without essential loss of generality that the para- 
meter of interest is Y = 1 /¢2, the ratio of the first two canonical parameters. 
Under a null hypothesis y = Wo the log likelihood is thus 


exp{(s1 Wo + 52)h2 + sT GP! — ky}, (4.20) 


where, for example, obl = (¢3, ¢4,...). It follows that in order to achieve 
a distribution independent of the nuisance parameters technical conditioning 
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is required on (sı Yo + s2, sP). That is, to test for departures from the null 
hypothesis leading to, say, large values of $1, we consider the upper tail of its 
conditional distribution. 


Example 4.7. Normal mean, variance unknown. The canonical parameters for 


2 are w/o 


a normal distribution of unknown mean u and unknown variance o 
and 1/o*. Thus inference about u is concerned with the ratio of canonical 
parameters. Under the null hypothesis u = uo, the sufficient statistic for o? 
is E (Y; — uo). Under the null hypothesis the conditional distribution of the 
vector Y is uniform on the sphere defined by the conditioning statistic and for 
the observed value y the required p-value is the proportion of the surface area 
of the sphere corresponding to a cap defined by y and centred on the unit line 
through Wo. Some calculation, which will not be reproduced here, shows this 
argument to be a rather tortuous route to the standard Student f statistic and its 
distribution. 

A similar but slightly more complicated argument yields a significance test 
for a regression parameter in a normal-theory linear model. 


Example 4.8. Comparison of gamma distributions. Suppose that Y1, Y2 are 
independently distributed with densities 


pe (pryKy”"* | exp(— pryk) / T (my) (4.21) 


for k = 1,2 and that interest lies in Y = p2/p1. This might arise in comparing 
the rates of two Poisson processes or alternatively in considering the ratio of two 
normal-theory estimates of variance having respectively (2m; , 2m2) degrees of 
freedom. 

The general argument leads to consideration of the null hypothesis p2 = wop1 
and thence to study of the conditional distribution of Y; given Yı + WoY2. This 
can be shown to be equivalent to using the pivot 


(02 ¥2/mz2)/(e1¥1/m) (4.22) 


having pivotal distribution the standard variance-ratio or F distribution with 
(2m1, 2m2) degrees of freedom. 


The reason that in the last two examples a rather round-about route is involved 
in reaching an apparently simple conclusion is that the exponential family takes 
no account of simplifications emerging from the transformational structure of 
scale and/or location intrinsic to these two problems. 


Example 4.9. Unacceptable conditioning. Suppose that (Y1, Y2) have inde- 
pendent Poisson distribution of means (u1, u2) and hence canonical parameters 
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(log 41, log u2). A null hypothesis about the ratio of canonical parameters, 
namely log u2 = Wo log u1, is 


Bau," (4.23) 


The previous argument requires conditioning the distribution of Y2 on the 
value of y2 + Yoy1. Now especially if wo is irrational, or even if it is rational of 
the form a/b, where a and b are large coprime integers, there may be very few 
points of support in the conditional distribution in question. In an extreme case 
the observed point may be unique. The argument via technical conditioning 
may thus collapse; moreover small changes in Wo induce dramatic changes 
in the conditional distribution. Yet the statistical problem, even if somewhat 
artificial, is not meaningless and is clearly capable of some approximate answer. 
If the observed values are not too small the value of y cannot be too far from 
log y2/ log yı and the arguments of asymptotic theory to be developed later in 
the book can be deployed to produce approximate tests and confidence intervals. 

The important implication of this example is that the use of technical con- 
ditioning to produce procedures whose properties do not depend on nuisance 
parameters is powerful and appealing, but that insistence on exact separation 
from the nuisance parameter may be achieved at too high a price. 


4.4.2 Transformation models 


A different kind of reduction is possible for models such as the location model 
which preserve a special structure under a family of transformations. 


Example 4.10. Location model. We return to Example 1.8, the location model. 
The likelihood is I1g(y,% — u) which can be rearranged in terms of the order 
Statistics yj), i.e., the observed values arranged in increasing order of magnitude. 
These in general form the sufficient statistic for the model, minimal except 
for the normal distribution. Now the random variables corresponding to the 
differences between order statistics, specified, for example, as 


a= (ym — YA)>+ ++» = YA)) = a +5 An), (4.24) 


have a distribution not depending on u and hence are ancillary; inference about 
u is thus based on the conditional distribution of, say, yq) given A = a. The 
choice of yq) is arbitrary because any estimate of location, for example the 
mean, is y(1) plus a function of a, and the latter is not random after conditioning. 
Now the marginal density of A is 


[eo — Wg(vay + a2 — u): + 80Vd) + an — u)dya). (4.25) 
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The integral with respect to yq) is equivalent to one integrated with respect 
to u so that, on reexpressing the integrand in terms of the likelihood for the 
unordered variables, the density of A is the function necessary to normalize the 
likelihood to integrate to one with respect to jz. That is, the conditional density 
of Y(1), or of any other measure of location, is determined by normalizing the 
likelihood function and regarding it as a function of y(1) for given a. This implies 
that p-values and confidence limits for u result from normalizing the likelihood 
and treating it like a distribution of u. 

This is an exceptional situation in frequentist theory. In general, likelihood is 
a point function of its argument and integrating it over sets of parameter values 
is not statistically meaningful. In a Bayesian formulation it would correspond 
to combination with a uniform prior but no notion of a prior distribution is 
involved in the present argument. 

The ancillary statistics allow testing conformity with the model. For example, 
to check on the functional form of g(.) it would be best to work with the 
differences of the ordered values from the mean, (yy — y) for / = 1,...,n. 
These are functions of a as previously defined. A test statistic could be set up 
informally sensitive to departures in distributional form from those implied by 
g(.). Or (ya — y) could be plotted against G! {(1— 1/2)/(n + 1)}, where G(.) 
is the cumulative distribution function corresponding to the density g(.). 


From a more formal point of view, there is defined a set, in fact a group, of 
transformations on the sample space, namely the set of location shifts. Associ- 
ated with that is a group on the parameter space. The sample space is thereby 
partitioned into orbits defined by constant values of a, invariant under the 
group operation, and position on the orbit conditional on the orbit in question 
provides the information about u. This suggests more general formulations of 
which the next most complicated is the scale and location model in which the 
density is t™!g{(y— u)/t)}. Here the family of implied transformations is two- 
dimensional, corresponding to changes of scale and location and the ancillary 
statistics are standardized differences of order statistics, 1.e., 


OG — ym) Oo — yw) -s Ya — Yay)/ (V2) — Yay)} = a}, . +5 G)- 
(4.26) 


There are many essentially equivalent ways of specifying the ancillary 
statistic. 

We can thus write estimates of jz and t as linear combinations of yg) and 
yo) with coefficients depending on the fixed values of the ancillary statist- 
ics. Further, the Jacobian of the transformation from the y) to the variables 


4.5 Some further Bayesian examples 59 


(ya), YQ). 4") is (Va) — ya yn so that all required distributions are directly 
derivable from the likelihood function. 

A different and in some ways more widely useful role of transformation 
models is to suggest reduction of a sufficient statistic to focus on a particular 
parameter of interest. 


Example 4.11. Normal mean, variance unknown (ctd). To test the null hypo- 
thesis u = uo when the variance ø? is unknown using the statistics (Y, s?) we 
need a function that is unaltered by the transformations of origin and units of 
measurement encapsulated in 


(Y,s*) to (a + bY, b*s*) (4.27) 
and 
utoat+ bu. (4.28) 


This is necessarily a function of the standard Student f statistic. 
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In principle the prior density in a Bayesian analysis is an insertion of additional 
information and the form of the prior should be dictated by the nature of that 
evidence. It is useful, at least for theoretical discussion, to look at priors which 
lead to mathematically tractable answers. One such form, useful usually only 
for nuisance parameters, is to take a distribution with finite support, in particular 
a two-point prior. This has in one dimension three adjustable parameters, the 
position of two points and a probability, and for some limited purposes this 
may be adequate. Because the posterior distribution remains concentrated on 
the same two points computational aspects are much simplified. 

We shall not develop that idea further and turn instead to other examples 
of parametric conjugate priors which exploit the consequences of exponential 
family structure as exemplified in Section 2.4. 


Example 4.12. Normal variance. Suppose that Y,..., Y, are independently 
normally distributed with known mean, taken without loss of generality to be 
zero, and unknown variance o?. The likelihood is, except for a constant factor, 


1 
= exp { — Lyz/(20”)}. (4.29) 


The canonical parameter is ¢ = 1/o* and simplicity of structure suggests 
taking ¢ to have a prior gamma density which it is convenient to write in the form 


T(G; 8 nr) = g(gp)?'e- 8? / P(nz/2), (4.30) 


60 More complicated situations 


defined by two quantities assumed known. One is ny, which plays the role of an 
effective sample size attached to the prior density, by analogy with the form of 
the chi-squared density with n degrees of freedom. The second defining quantity 
is g. Transformed into a distribution for ø? it is often called the inverse gamma 
distribution. Also Ez (®) = nz /(2g). 

On multiplying the likelihood by the prior density, the posterior density of 
® is proportional to 


pt1 exp [ -| (Eyk + nz /Er (®))}6/2]. (4.31) 
The posterior distribution is in effect found by treating 
[Eyk + nx /Ex(®)}® (4.32) 


as having a chi-squared distribution with n + ny degrees of freedom. 

Formally, frequentist inference is based on the pivot DY; /o”, the pivotal 
distribution being the chi-squared distribution with n degrees of freedom. There 
is formal, although of course not conceptual, equivalence between the two 
methods when ny = 0. This arises from the improper prior d¢/@, equivalent 
to da/o or to a uniform improper prior for logo. That is, while there is never 
in this setting exact agreement between Bayesian and frequentist solutions, the 
latter can be approached as a limit as ny — 0. 


Example 4.13. Normal mean, variance unknown (ctd). In Section 1.5 and 
Example 4.12 we have given simple Bayesian posterior distributions for the 
mean of a normal distribution with variance known and for the variance when 
the mean is known. 

Suppose now that both parameters are unknown and that the mean is the 
parameter of interest. The likelihood is proportional to 


l 7 2 =)2 2 
on exp[—{n(y — u) + E(k = y)}/2o*)].- (4.33) 


In many ways the most natural approach may seem to be to take the prior dis- 
tributions for mean and variance to correspond to independence with the forms 
used in the single-parameter analyses. That is, an inverse gamma distribution 
for ø? is combined with a normal distribution of mean m and variance v for u. 
We do not give the details. 

A second possibility is to modify the above assessment by replacing v by 
bo*, where b is a known constant. Mathematical simplification that results from 
this stems from jz/o? rather than u itself being a component of the canonical 
parameter. A statistical justification might sometimes be that o establishes the 
natural scale for measuring random variability and that the amount of prior 
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information is best assessed relative to that. Prior independence of mean and 
variance has been abandoned. 
The product of the likelihood and the prior is then proportional to 


gure le we. (4.34) 

where 

w = w{n/2 + 1/(2b)} — w(ny + m/b) + g 

+ ny? /2 + Eyk — ¥)?/2 + m/b). (4.35) 
To obtain the marginal posterior distribution of u we integrate with respect to 
¢ and then normalize the resulting function of u. The result is proportional to 
w "/2—"x/2, Now note that the standard Student rt distribution with d degrees 
of freedom has density proportional to (1 + 17/d)~@@+)/?, Comparison with 
w shows without detailed calculation that the posterior density of jz involves 


the Student ¢ distribution with degrees of freedom n + nz — 1. More detailed 
calculation shows that (u — {4z)/S_ has a posterior t distribution, where 


[tx = (ny + m/b)/(n + 1/b) (4.36) 


is the usual weighted mean and sz is a composite estimate of precision obtained 
from È (y — p’, the prior estimate of variance and the contrast (m — jy). 
Again the frequentist solution based on the pivot 


(u—Y)./n 
{1EY — ¥)2/(n — Dy} /2” 


(4.37) 


the standard Student ¢ statistic with degrees of freedom n — 1, is recovered but 
only as a limiting case. 


Example 4.14. Components of variance. We now consider an illustration of a 
wide class of problems introduced as Example 1.9, in which the structure of 
unexplained variation in the data is built up from more than one component 
random variable. We suppose that the observed random variable Ys for k = 
1,...,m;s =1,...,t has the form 


Yks = U + Nk + ks, (4.38) 


where u is an unknown mean and the 7 and the € are mutually independent nor- 
mally distributed random variables with zero mean and variances, respectively 
Tn, Te, called components of variance. 

Suppose that interest focuses on the components of variance. To simplify 
some details suppose that u is known. In the balanced case studied here, and 
only then, there is a reduction to a (2, 2) exponential family with canonical 
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parameters (1/t¢, 1/(te + fT )) so that a simple analytical form would emerge 
from giving these two parameters independent gamma prior distributions, 

While this gives an elegant result, the dependence on the subgroup size t is for 
most purposes likely to be unreasonable. The restriction to balanced problems, 
i.e., those in which the subgroup size t is the same for all k is also limiting, at 
least so far as exact analytic results are concerned. 

A second possibility is to assign the separate variance components independ- 
ent inverse gamma priors and a third is to give one variance component, say Te, 
an inverse gamma prior and to give the ratio t,,/t- an independent prior distri- 
bution. There is no difficulty in principle in evaluating the posterior distribution 
of the unknown parameters and any function of them of interest. 

An important more practical point is that the relative precision for estimating 
the upper variance component Ty is less, and often much less, than that for 
estimating a variance from a group of m observations and having m— 1 degrees of 
freedom in the standard terminology. Thus the inference will be more sensitive 
to the form of the prior density than is typically the case in simpler examples. 
The same remark applies even more strongly to complex multilevel models; it 
is doubtful whether such models should be used for a source of variation for 
which the effective replication is less than 10-20. 


Notes 4 


Section 4.2. For a general discussion of nonpersonalistic Bayesian theory and 
of a range of statistical procedures developed from this perspective, see Box 
and Tiao (1973). Box (1980) proposes the use of Bayesian arguments for para- 
metric inference and frequentist arguments for model checking. Bernardo and 
Smith (1994) give an encyclopedic account of Bayesian theory largely from the 
personalistic perspective. See also Lindley (1990). For a brief historical review, 
see Appendix A. 

Bayesian approaches are based on formal requirements for the concept of 
probability. Birnbaum (1962) aimed to develop a concept of statistical evidence 
on the basis of a number of conditions that the concept should satisfy and it 
was argued for a period that this approach also led to essentially Bayesian 
methods. There were difficulties, however, probably in part at least because 
of confusions between the weak and strong likelihood principles and in his 
final paper Birnbaum (1969) considers sensible confidence intervals to be the 
primary mode of inference. Fisher (1956), by contrast, maintains that different 
contexts require different modes of inference, in effect arguing against any 
single unifying theory. 
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Section 4.3. The discussion of the uniform distribution follows Welch (1939), 
who, however, drew and maintained the conclusion that the unconditional 
distribution is appropriate for inference. The problem of the two measure- 
ment instruments was introduced by Cox (1958c) in an attempt to clarify the 
issues involved. Most treatments of regression and the linear model adopt the 
conditional formulation of Example 4.3, probably wisely without comment. 


Section 4.4. The relation between the Poisson distribution and the multino- 
mial distribution was used as a mathematical device by R. A. Fisher in an early 
discussion of degrees of freedom in the testing of, for example, contingency 
tables. It is widely used in the context of log Poisson regression in the analysis 
of epidemiological data. The discussion of the Mantel-Haenszel and related 
problems follows Cox (1958a, 1958b) and Cox and Snell (1988). The discus- 
sion of transformation models stems from Fisher (1934) and is substantially 
developed by Fraser (1979) and forms the basis of much of his subsequent 
work (Fraser, 2004), notably in approximating other models by models locally 
with transformation structure. 

Completeness of a family of distributions means essentially that there is no 
nontrivial function of the random variable whose expectation is zero for all 
members of the family. The notion is discussed in the more mathematical of 
the books listed in Notes 1. 

In the Neyman—Pearson development the elimination of nuisance parameters 
is formalized in the notions of similar tests and regions of Neyman structure 
(Lehmann and Romano, 2004). 

The distinction between the various formulations of 2 x 2 contingency tables 
is set out with great clarity by Pearson (1947). Fisher (1935b) introduces the gen- 
eralized hypergeometric distribution with the somewhat enigmatic words: ‘If it 
be admitted that these marginal frequencies by themselves supply no informa- 
tion on the point at issue ...’. Barnard introduced an unconditional procedure 
that from a Neyman—Pearson perspective achieved a test of preassigned prob- 
ability under Ho and improved properties under alternatives, but withdrew the 
procedure, after seeing a comment by Fisher. The controversy continues. From 
the perspective taken here achieving a preassigned significance level is not an 
objective and the issue is more whether some form of approximate conditioning 
might be preferable. 
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Interpretations of uncertainty 


Summary. This chapter discusses the nature of probability as it is used to rep- 
resent both variability and uncertainty in the various approaches to statistical 
inference. After some preliminary remarks, the way in which a frequency notion 
of probability can be used to assess uncertainty is reviewed. Then two contrast- 
ing notions of probability as representing degree of belief in an uncertain event 
or hypothesis are examined. 


5.1 General remarks 


We can now consider some issues involved in formulating and comparing the 
different approaches. 

In some respects the Bayesian formulation is the simpler and in other respects 
the more difficult. Once a likelihood and a prior are specified to a reasonable 
approximation all problems are, in principle at least, straightforward. The res- 
ulting posterior distribution can be manipulated in accordance with the ordinary 
laws of probability. The difficulties centre on the concepts underlying the defin- 
ition of the probabilities involved and then on the numerical specification of the 
prior to sufficient accuracy. 

Sometimes, as in certain genetical problems, it is reasonable to think of 6 as 
generated by a stochastic mechanism. There is no dispute that the Bayesian 
approach is at least part of a reasonable formulation and solution in such 
situations. In other cases to use the formulation in a literal way we have to 
regard probability as measuring uncertainty in a sense not necessarily directly 
linked to frequencies. We return to this issue later. Another possible justifica- 
tion of some Bayesian methods is that they provide an algorithm for extracting 
from the likelihood some procedures whose fundamental strength comes from 
frequentist considerations. This can be regarded, in particular, as supporting 
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a broad class of procedures, known as shrinkage methods, including ridge 
regression. 

The emphasis in this book is quite often on the close relation between 
answers possible from different approaches. This does not imply that the differ- 
ent views never conflict. Also the differences of interpretation between different 
numerically similar answers may be conceptually important. 


5.2 Broad roles of probability 


A recurring theme in the discussion so far has concerned the broad distinction 
between the frequentist and the Bayesian formalization and meaning of probab- 
ility. Kolmogorov’s axiomatic formulation of the theory of probability largely 
decoupled the issue of meaning from the mathematical aspects; his axioms were, 
however, firmly rooted in a frequentist view, although towards the end of his 
life he became concerned with a different interpretation based on complexity. 
But in the present context meaning is crucial. 

There are two ways in which probability may be used in statistical discus- 
sions. The first is phenomenological, to describe in mathematical form the 
empirical regularities that characterize systems containing haphazard variation. 
This typically underlies the formulation of a probability model for the data, in 
particular leading to the unknown parameters which are regarded as a focus 
of interest. The probability of an event € is an idealized limiting proportion of 
times in which € occurs in a large number of repeat observations on the system 
under the same conditions. In some situations the notion of a large number of 
repetitions can be reasonably closely realized; in others, as for example with eco- 
nomic time series, the notion is a more abstract construction. In both cases the 
working assumption is that the parameters describe features of the underlying 
data-generating process divorced from special essentially accidental features of 
the data under analysis. 

That first phenomenological notion is concerned with describing variability. 
The second role of probability is in connection with uncertainty and is thus epi- 
stemological. In the frequentist theory we adapt the frequency-based view of 
probability, using it only indirectly to calibrate the notions of confidence inter- 
vals and significance tests. In most applications of the Bayesian view we need 
an extended notion of probability as measuring the uncertainty of € given F, 
where now €, for example, is not necessarily the outcome of a random system, 
but may be a hypothesis or indeed any feature which is unknown to the investig- 
ator. In statistical applications € is typically some statement about the unknown 
parameter 0 or more specifically about the parameter of interest Y. The present 
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discussion is largely confined to such situations. The issue of whether a single 
number could usefully encapsulate uncertainty about the correctness of, say, 
the Fundamental Theory underlying particle physics is far outside the scope of 
the present discussion. It could, perhaps, be applied to a more specific question 
such as a prediction of the Fundamental Theory: will the Higgs boson have 
been discovered by 2010? 

One extended notion of probability aims, in particular, to address the point 
that in interpretation of data there are often sources of uncertainty additional 
to those arising from narrow-sense statistical variability. In the frequentist 
approach these aspects, such as possible systematic errors of measurement, 
are addressed qualitatively, usually by formal or informal sensitivity analysis, 
rather than incorporated into a probability assessment. 


5.3 Frequentist interpretation of upper limits 


First we consider the frequentist interpretation of upper limits obtained, for 
example, from a suitable pivot. We take the simplest example, Example 1.1, 
namely the normal mean when the variance is known, but the considerations 
are fairly general. The upper limit 


Y + ke00/./n, (5.1) 
derived here from the probability statement 
P(u < Y + kžoo/ vn) =1-c, (5.2) 


is a particular instance of a hypothetical long run of statements a proportion 
1 — c of which will be true, always, of course, assuming our model is sound. We 
can, at least in principle, make such a statement for each c and thereby generate 
a collection of statements, sometimes called a confidence distribution. There is 
no restriction to a single c, so long as some compatibility requirements hold. 

Because this has the formal properties of a distribution for jz it was called 
by R. A. Fisher the fiducial distribution and sometimes the fiducial probability 
distribution. A crucial question is whether this distribution can be interpreted 
and manipulated like an ordinary probability distribution. The position is: 


e a single set of limits for u from some data can in some respects be 
considered just like a probability statement for ju; 

e such probability statements cannot in general be combined or manipulated 
by the laws of probability to evaluate, for example, the chance that u 
exceeds some given constant, for example zero. This is clearly illegitimate 
in the present context. 
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That is, as a single statement a | — c upper limit has the evidential force of 
a statement of a unique event within a probability system. But the rules for 
manipulating probabilities in general do not apply. The limits are, of course, 
directly based on probability calculations. 

Nevertheless the treatment of the confidence interval statement about the 
parameter as if it is in some respects like a probability statement contains the 
important insights that, in inference for the normal mean, the unknown para- 
meter is more likely to be near the centre of the interval than near the end-points 
and that, provided the model is reasonably appropriate, if the mean is outside 
the interval it is not likely to be far outside. 

A more emphatic demonstration that the sets of upper limits defined in this 
way do not determine a probability distribution is to show that in general there 
is an inconsistency if such a formal distribution determined in one stage of 
analysis is used as input into a second stage of probability calculation. We shall 
not give details; see Note 5.2. 

The following example illustrates in very simple form the care needed in 
passing from assumptions about the data, given the model, to inference about 
the model, given the data, and in particular the false conclusions that can follow 
from treating such statements as probability statements. 


Example 5.1. Exchange paradox. There are two envelopes, one with ¢@ euros 
and the other with 2¢ euros. One is chosen at random and given to you and 
when opened it is found to contain 10° euros. You now have a choice. You 
may keep the 10° euros or you may open the other envelope in which case you 
keep its contents. You argue that the other envelope is equally likely to contain 
500 euros or 2 x 10° euros and that, provided utility is linear in money, the 
expected utility of the new envelope is 1.25 x 10° euros. Therefore you take 
the new envelope. This decision does not depend on the particular value 10° so 
that there was no need to open the first envelope. 

The conclusion is clearly wrong. The error stems from attaching a probability 
in a non-Bayesian context to the content of the new and as yet unobserved 
envelope, given the observation; the only probability initially defined is that of 
the possible observation, given the parameter, i.e., given the content of both 
envelopes. Both possible values for the content of the new envelope assign 
the same likelihood to the data and are in a sense equally consistent with the 
data. In a non-Bayesian setting, this is not the same as the data implying equal 
probabilities to the two possible values. In a Bayesian discussion with a proper 
prior distribution, the relative prior probabilities attached to 500 euros and 2000 
euros would be crucial. The improper prior that attaches equal prior probability 
to all values of ¢ is trapped by the same paradox as that first formulated. 
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5.4 Neyman—Pearson operational criteria 


We now outline a rather different approach to frequentist intervals not via an 
initial assumption that the dependence on the data is dictated by consideration of 
sufficiency. Suppose that we wish to find, for example, upper limits for y with 
the appropriate frequency interpretation, i.e., derived from a property such as 


P(u <¥+ktoo//n) = 1- c. (5.3) 
Initially we may require that exactly (or to an adequate approximation) for all 6 
Ply <T(Y;c)}=1-c, (5.4) 


where 7(Y;c) is a function of the observed random variable Y. This ensures 
that the right coverage probability is achieved. There are often many ways of 
achieving this, some in a sense more efficient than others. It is appealing to 
define optimality, i.e., sensitivity of the analysis, by requiring 


Py’ <T(Y¥;c)} (5.5) 


to be minimal for all Y’ > y subject to correct coverage. This expresses the 
notion that the upper bound should be as small as possible and does so in a 
way that is essentially unchanged by monotonic increasing transformations of 
w and T. 

Essentially this strong optimality requirement is satisfied only when a simple 
Fisherian reduction is possible. In the Neyman—Pearson approach dependence 
on the sufficient statistic, in the specific example F, emerges out of the optimality 
requirements, rather than being taken as a starting point justified by the defining 
property of sufficient statistics. 


5.5 Some general aspects of the frequentist approach 


The approach via direct study of hypothetical long-run frequency properties has 
the considerable advantage that it provides a way of comparing procedures that 
may have compensating properties of robustness, simplicity and directness and 
of considering the behaviour of procedures when the underlying assumptions 
are not satisfied. 

There remains an important issue, however. Is it clear in each case what is the 
relevant long run of repetitions appropriate for the interpretation of the specific 
data under analysis? 


Example 5.2. Two measuring instruments (ctd). The observed random vari- 
ables are (Y,D), where Y represents the measurement and D is an indicator 
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of which instrument is used, i.e., we know which variance applies to our 
observation. 

The minimal sufficient statistic is (Y, D) with an obvious generalization if 
repeat observations are made on the same apparatus. We have a (2, 1) exponen- 
tial family in which, however, one component of the statistic, the indicator D, 
has only two points of support. By the Fisherian reduction we condition on the 
observed and exactly known variance, i.e., we use the normal distribution we 
know was determined by the instrument actually used and we take no account 
of the fact that in repetitions we might have obtained a quite different precision. 

Unfortunately the Neyman—Pearson approach does not yield this result auto- 
matically; it conflicts, superficially at least, with the sensitivity requirement for 
optimality. That is, if we require a set of sample values leading to the long-run 
inclusion of the true value with a specified probability and having maximum 
probability of exclusion of (nearly all) false values, the conditional procedure 
is not obtained. 

We need a supplementary principle to define the appropriate ensemble of 
repetitions that determines the statistical procedure. Note that this is necessary 
even if the repetitive process were realized physically, for example if the above 
measuring procedure took place every day over a long period. 

This example can be regarded as a pared-down version of Example 4.3, 
concerning regression analysis. The use of D to define a conditional distribution 
is at first sight conceptually different from that used to obtain inference free of 
a nuisance parameter. 


5.6 Yet more on the frequentist approach 


There are two somewhat different ways of considering frequentist procedures. 
The one in some sense closest to the measurement of strength of evidence is the 
calculation of p-values as a measuring instrument. This is to be assessed, as are 
other measuring devices, by calibration against its performance when used. The 
calibration is the somewhat hypothetical one that a given p has the interpretation 
given previously in Section 3.2. That is, were we to regard a particular p as just 
decisive against the null hypothesis, then in only a long-run proportion p of 
cases in which the null hypothesis is true would it be wrongly rejected. In this 
view such an interpretation serves only to explain the meaning of a particular p. 

A more direct view is to regard a significance test at some preassigned 
level, for example 0.05, as defining a procedure which, if used as a basis for 
interpretation, controls a certain long-run error rate. Then, in principle at least, a 
critical threshold value of p should be chosen in the context of each application 
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and used as indicated. This would specify what is sometimes called a rule of 
inductive behaviour. If taken very literally no particular statement of a probab- 
ilistic nature is possible about any specific case. In its extreme form, we report 
a departure from the null hypothesis if and only if p is less than or equal to the 
preassigned level. While procedures are rarely used in quite such a mechanical 
form, such a procedure is at least a crude model of reporting practice in some 
areas of science and, while subject to obvious criticisms and improvement, 
does give some protection against the overemphasizing of results of transit- 
ory interest. The procedure is also a model of the operation of some kinds of 
regulatory agency. 

We shall not emphasize that approach here but concentrate on the reporting 
of p-values as an analytical and interpretational device. In particular, this means 
that any probabilities that are calculated are to be relevant to what are sometimes 
referred to as the unique set of data under analysis. 


Example 5.3. Rainy days in Gothenburg. Consider daily rainfall measured at 
some defined point in Gothenburg, say in April. To be specific let W be the 
event that on a given day the rainfall exceeds 5 mm. Ignore climate change and 
any other secular changes and suppose we have a large amount of historical data 
recording the occurrence and nonoccurrence of W on April days. For simplicity, 
ignore possible dependence between nearby days. The relative frequency of W 
when we aggregate will tend to stabilize and we idealize this to a limiting value 
xw, the probability of a wet day. This is frequency-based, a physical measure 
of the weather-system and what is sometimes called a chance to distinguish it 
from probability as assessing uncertainty. 

Now consider the question: will it be wet in Gothenburg tomorrow, a very 
specific day in April? 

Suppose we are interested in this unique day, not in a sequence of predictions. 
If probability is a degree of belief, then there are arguments that in the absence 
of other evidence the value zy should apply or if there are other considerations, 
then they have to be combined with zw. 

But supposing that we stick to a frequentist approach. There are then two 
lines of argument. The first, essentially that of a rule of inductive behaviour, is 
that probability is inapplicable (at least until tomorrow midnight by when the 
probability will be either 0 or 1). We may follow the rule of saying that ‘It will 
rain tomorrow’. If we follow such a rule repeatedly we will be right a proportion 
xw of times but no further statement about tomorrow is possible. 

Another approach is to say that the probability for tomorrow is zw but 
only if two conditions are satisfied. The first is that tomorrow is a member 
of an ensemble of repetitions in a proportion zw of which the event occurs. 
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The second is that one cannot establish tomorrow as a member of a sub-ensemble 
with a different proportion, that tomorrow is not a member of a recognizable sub- 
set. That is, the statement must be adequately conditional. There are substantial 
difficulties in implementing this notion precisely in a mathematical formaliz- 
ation and these correspond to serious difficulties in statistical analysis. If we 
condition too far, every event is unique. Nevertheless the notion of appropriate 
conditioning captures an essential aspect of frequency-based inference. 


Example 5.4. The normal mean (ctd). We now return to the study of limits for 
the mean yz of a normal distribution with known variance; this case is taken 
purely for simplicity and the argument is really very general. In the Bayesian 
discussion we derive a distribution for u which is, when the prior is "flat", 
normal with mean y and variance og /n. In the frequentist approach we start 
from the statement 


P(u < Y + kžoo//n) = 1- c. (5.6) 


Then we take our data with mean y and substitute into the previous equation 
to obtain a limit for u, namely y + p.o9/,/n. Following the discussion of 
Example 5.3 we have two interpretations. The first is that probability does not 
apply, only the properties of the rule of inductive behaviour. The second is that 
probability does apply, provided there is no further conditioning set available 
that would lead to a (substantially) different probability. Unfortunately, while 
at some intuitive level it is clear that, in the absence of further information, no 
useful further conditioning set is available, careful mathematical discussion of 
the point is delicate and the conclusions less clear than one might have hoped; 
see Note 5.6. 


5.7 Personalistic probability 


We now turn to approaches that attempt to measure uncertainty directly by 
probability and which therefore potentially form a basis for a general Bayesian 
discussion. 

A view of probability that has been strongly advocated as a systematic base 
for statistical theory and analysis is that P(E | F) represents the degree of belief 
in E, given F, held by a particular individual, commonly referred to as You. 
It is sometimes convenient to omit the conditioning information F although in 
applications it is always important and relevant and includes assumptions, for 
example, about the data-generating process which You consider sensible. 
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While many treatments of this topic regard the frequency interpretation of 
probability merely as the limiting case of Your degree of belief when there 
is a large amount of appropriate data, it is in some ways simpler in defining 
P(E) to suppose available in principle for any p such that 0 < p < 1 an agreed 
system of generating events with long-run frequency of occurrence p, or at least 
generally agreed to have probability p. These might, for example, be based on 
a well-tested random number generator. 

You then consider a set of hypothetical situations in which You must choose 
between 


e a valuable prize if E is true and zero otherwise, or 
e that same prize if one outcome of the random system with probability p 
occurs and zero otherwise. 


The value of p at which You are indifferent between these choices is defined 
to be P(E); such a p exists under weak conditions. The argument implies the 
existence of P(E) in principle for all possible events €. The definition is not 
restricted to events in replicated random systems. 

In statistical arguments the idea is in principle applied for each value of 0* 
of the parameter to evaluate the probability that the parameter ©, considered 
now as random, is less than 6%, i.e., in principle to build the prior distribution 
of O. 

To show that probability as so defined has the usual mathematical properties 
of probability, arguments can be produced to show that behaviour in different 
but related situations that was inconsistent with the standard laws of probability 
would be equivalent to making loss-making choices in the kind of situation 
outlined above. Thus the discussion is not necessarily about the behaviour of 
real people but rather is about the internal consistency of Your probability 
assessments, so-called coherency. In fact experiments show that coherency is 
often not achieved, even in quite simple experimental settings. 

Because Your assessments of uncertainty obey the laws of probability, they 
satisfy Bayes’ theorem. Hence it is argued that the only coherent route to stat- 
istical assessment of Your uncertainty given the data y must be consistent with 
the application of Bayes’ theorem to calculate the posterior distribution of the 
parameter of interest from the prior distribution and the likelihood given by the 
model. 

In this approach, it is required that participation in the choices outlined above 
is compulsory; there is no concept of being ignorant about £ and of refusing to 
take part in the assessment. 

Many treatments of this type of probability tie it strongly to the decisions 
You have to make on the basis of the evidence under consideration. 
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5.8 Impersonal degree of belief 


It is explicit in the treatment of the previous section that the probabilities con- 
cerned belong to a particular person, You, and there is no suggestion that even 
given the same information F different people will have the same probability. 
Any agreement between individuals is presumed to come from the availability 
of a large body of data with an agreed probability model, when the contribu- 
tion of the prior will often be minor, as indeed we have seen in a number of 
examples. 

A conceptually quite different approach is to define P(E | F) as the degree 
of belief in E that a reasonable person would hold given F. The presumption 
then is that differences between reasonable individuals are to be considered 
as arising from their differing information bases. The term objective degree of 
belief may be used for such a notion. 

The probability scale can be calibrated against a standard set of frequency- 
based chances. Arguments can again be produced as to why this form of 
probability obeys the usual rules of probability theory. 

To be useful in individual applications, specific values have to be assigned to 
the probabilities and in many applications this is done by using a flat prior which 
is intended to represent an initial state of ignorance, leaving the final analysis to 
be essentially a summary of information provided by the data. Example 1.1, the 
normal mean, provides an instance where a very dispersed prior in the form of 
anormal distribution with very large variance v provides in the limit a Bayesian 
solution identical with the confidence interval form. 

There are, however, some difficulties with this. 


e Even for a scalar parameter 0 the flat prior is not invariant under 
reparameterization. Thus if 0 is uniform e° has an improper exponential 
distribution, which is far from flat. 

e Ina specific instance it may be hard to justify a distribution putting much 
more weight outside any finite interval than it does inside as a 
representation of ignorance or indifference. 

e In multidimensional parameters the difficulties of specifying a suitable prior 
are much more acute. 


For a scalar parameter the first point can be addressed by finding a form of the 
parameter closest to being a location parameter. One way of doing this will be 
discussed in Example 6.3. 

The difficulty with multiparameter priors can be seen from the following 
example. 
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Example 5.5. The noncentral chi-squared distribution. Let (Y1,..., Yn) be 
independently normally distributed with unit variance and means [11,..., Un 
referring to independent situations and therefore with independent priors, 
assumed flat. Each u? has posterior expectation y? + 1. Then if interest focuses 
on A? = E lees it has posterior expectation D? + n. In fact its posterior 
distribution is noncentral chi-squared with n degrees of freedom and noncent- 
rality D? = bp ae This implies that, for large n, A? is with high probability 
D?+n+ Op (vn). But this is absurd in that whatever the true value of A?, 
the statistic D? is with high probability A? + n + Op(./n). A very flat prior in 
one dimension gives good results from almost all viewpoints, whereas a very 
flat prior and independence in many dimensions do not. This is called Stein’s 
paradox or more accurately one of Stein’s paradoxes. 

If it were agreed that only the statistic D? and the parameter A? are relevant 
the problem could be collapsed into a one-dimensional one. Such a reduction 
is, in general, not available in multiparameter problems and even in this one a 
general Bayesian solution is not of this reduced form. 


Quite generally a prior that gives results that are reasonable from various 
viewpoints for a single parameter will have unappealing features if applied 
independently to many parameters. The following example could be phrased 
more generally, for example for exponential family distributions, but is given 
now for binomial probabilities. 


Example 5.6. A set of binomial probabilities. Let 1, ..., 7, be separate bino- 
mial probabilities of success, referring, for example, to properties of distinct 
parts of some random system. For example, success and failure may refer 
respectively to the functioning or failure of a component. Suppose that to estim- 
ate each probability m independent trials are made with r; successes, trials for 
different events being independent. If each 7m is assumed to have a uniform 
prior on (0, 1), then the posterior distribution of zr; is the beta distribution 
my (1 — mg)" 
Berg + lym — rg +1)’ 
where the beta function in the denominator is a normalizing constant. It follows 
that the posterior mean of zr, is (rg + 1)/(m + 2). Now suppose that interest 
lies in some function of the zz, such as in the reliability context yw, = Myg. 
Because of the assumed independencies, the posterior distribution of Yn is 
derived from a product of beta-distributed random variables and hence is, for 
large m, close to a log normal distribution. Further, the mean of the posterior 
distribution is, by independence, 


(rg + 1)/(m + 2)" (5.8) 


(5.7) 
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and as n — œ this, normalized by Wy, is 


1+ 1/(zqm) 
A. 


1+2/m oo 


Now especially if n is large compared with m this ratio is, in general, very 
different from 1. Indeed if all the 7, are small the ratio is greater than 1 and if 
all the zz are near | the ratio is less than 1. This is clear on general grounds 
in that the probabilities encountered are systematically discrepant from the 
implications of the prior distribution. 

This use of prior distributions to insert information additional to and distinct 
from that supplied by the data has to be sharply contrasted with an empirical 
Bayes approach in which the prior density is chosen to match the data and 
hence in effect to smooth the empirical distributions encountered. For this a 
simple approach is to assume a conjugate prior, in this case a beta density 
proportional to 2*!~!(1 — )*2~! and having two unknown parameters. The 
marginal likelihood of the data, i.e., that obtained by integrating out 7, can 
thus be obtained and the As estimated by frequentist methods, such as those of 
Chapter 6. If errors in estimating the As are ignored the application of Bayes’ 
theorem to find the posterior distribution of any function of the z;, such as y, 
raises no special problems. To make this into a fully Bayesian solution, it is 
necessary only to adopt a prior distribution on the As; its form is unlikely to be 
critical. 


The difficulties with flat and supposedly innocuous priors are most striking 
when the number of component parameters is large but are not confined to this 
situation. 


Example 5.7. Exponential regression. Suppose that the exponential regression 
of Example 1.5 is rewritten in the form 


E(Y) =a + Bp*, (5.10) 


i.e., by writing p = e”, and suppose that it is known that 0 < p < 1. Suppose 
further that «œ, £, p are given independent uniform prior densities, the last over 
(0, 1) and that the unknown standard deviation o has a prior proportional to 
1/o; thus three of the parameters have improper priors to be regarded as limiting 
forms. 

Suppose further that n independent observations are taken at values zg = 
zo + ak, where zo,a > 0. Then it can be shown that the posterior density of p 
tends to concentrate near 0 or 1, corresponding in effect to a model in which 
E(Y) is constant. 
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5.9 Reference priors 


While a notion of probability as measuring objective degree of belief has much 
intuitive appeal, it is unclear how to supply even approximate numerical values 
representing belief about sources of knowledge external to the data under ana- 
lysis. More specifically to achieve goals similar to those addressed in frequentist 
theory it is required that the prior represents in some sense indifference or lack 
of information about relevant parameters. There are, however, considerable dif- 
ficulties centring around the issue of defining a prior distribution representing 
a state of initial ignorance about a parameter. Even in the simplest situation of 
there being only two possible true distributions, it is not clear that saying that 
each has prior probability 1/2 is really a statement of no knowledge and the 
situation gets more severe when one contemplates infinite parameter spaces. 

An alternative approach is to define a reference prior as a standard in which 
the contribution of the data to the resulting posterior is maximized without 
attempting to interpret the reference prior as such as a probability distribution. 

A sketch of the steps needed to do this is as follows. First for any random 
variable W with known density p(w) the entropy H(W) of a single observation 
on W is defined by 


H(W) = —E{log p(W)} = -f ro log p(w)dw. (5.11) 
This is best considered initially at least in the discrete form 


— Zp x log px, (5.12) 


where the sum is over the points of support of W. The general idea is that 
observing a random variable whose value is essentially known is uninformative 
or unsurprising; the entropy is zero if there is unit probability at a single point, 
whereas if the distribution is widely dispersed over a large number of individu- 
ally small probabilities the entropy is high. In the case of random variables, all 
discrete, representing Y and © we may compare the entropies of posterior and 
prior by 


Ey{H(© | Y)} - HO), (5.13) 


where the second term is the entropy of the prior distribution. In principle, the 
least informative prior is the one that maximizes this difference, i.e., which 
emphasizes the contribution of the data as contrasted with the prior. 

Now suppose that the whole vector Y is independently replicated n times to 
produce a new random variable Y*”. With the discrete parameter case contem- 
plated, under mild conditions, and in particular if the parameter space is finite, 
the posterior distribution will become concentrated on a single point and the 
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first term of (5.13) will tend to zero and so the required prior has maximum 
entropy. That is, the prior maximizes (5.12) and for finite parameter spaces this 
attaches equal prior probability to all points. 

There are two difficulties with this as a general argument. One is that if the 
parameter space is infinite H(0) can be made arbitrarily large. Also for con- 
tinuous random variables the definition (5.11) is not invariant under monotonic 
transformations. To encapsulate the general idea of maximizing the contribution 
of the data in this more general setting requires a more elaborate formulation 
involving a careful specification of limiting operations which will not be given 
here. The argument does depend on limiting replication from Y to Y*" as n 
increases; this is not the same as increasing the dimension of the observed 
random variable but rather is one of replication of the whole data. 

Two explicit conclusions from this development are firstly that for a scale and 
location model, with location parameter jz and scale parameter o, the reference 
prior has the improper form dudoa/o, i.e., it treats jz and log ø as having inde- 
pendent highly dispersed and effectively uniform priors. Secondly, for a single 
parameter 0 the reference prior is essentially uniform for a transformation of the 
parameter that makes the model close to a location parameter. The calculation 
of this forms Example 6.3. 

There are, however, difficulties with this approach, in particular in that the 
prior distribution for a particular parameter may well depend on which is the 
parameter of primary interest or even on the order in which a set of nuisance 
parameters is considered. Further, the simple dependence of the posterior on 
the data only via the likelihood regardless of the probability model is lost. If 
the prior is only a formal device and not to be interpreted as a probability, 
what interpretation is justified for the posterior as an adequate summary of 
information? In simple cases the reference prior may produce posterior distri- 
butions with especially good frequency properties but that is to show its use as 
a device for producing methods judged effective on other grounds, rather than 
as a conceptual tool. 

Examples 5.5 and 5.6 show that flat priors in many dimensions may yield 
totally unsatisfactory posterior distributions. There is in any case a sharp con- 
trast between such a use of relatively uninformative priors and the view that the 
objective of introducing the prior distribution is to inject additional informa- 
tion into the analysis. Many empirical applications of Bayesian techniques use 
proper prior distributions that appear flat relative to the information supplied 
by the data. If the parameter space to which the prior is attached is of relat- 
ively high dimension the results must to some extent be suspect. It is unclear at 
what point relatively flat priors that give good results from several viewpoints 
become unsatisfactory as the dimensionality of the parameter space increases. 
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5.10 Temporal coherency 


The word prior can be taken in two rather different senses, one referring to time 
ordering and the other, in the present context, to information preceding, or at 
least initially distinct from, the data under immediate analysis. Essentially it is 
the second meaning that is the more appropriate here and the distinction bears 
on the possibility of data-dependent priors and of consistency across time. 

Suppose that a hypothesis H is of interest and that it is proposed to obtain 
data to address H but that the data, to be denoted eventually by y, are not yet 
available. Then we may consider Bayes’ theorem in the form 


P*(H | y) x P(H)P*(y | A). (5.14) 


Here P* refers to a probability of a hypothesized situation; we do not yet know y. 
Later y becomes available and now 


P(H | y) x P*(A)P(y | H). (5.15) 


Now the roles of P and P* have switched. The prior P*(H) is the prior 
that we would have if we did not know y, even though, in fact, we do know 
y. Temporal coherency is the assumption that, for example, P(H) = P*(H). 
In many situations this assumption is entirely reasonable. On the other hand 
it is not inevitable and there is nothing intrinsically inconsistent in changing 
prior assessments, in particular in the light of experience obtained either in the 
process of data collection or from the data themselves. 

For example, a prior assessment is made in the design stage of a study, perhaps 
on the basis of previous similar studies or on the basis of well-established theory, 
say that a parameter yf will lie in a certain range. Such assessments are a crucial 
part of study design, even if not formalized quantitatively in a prior distribution. 
The data are obtained and are in clear conflict with the prior assessment; for 
example y, expected to be negative, is pretty clearly positive. 

There are essentially three possible explanations. The play of chance may 
have been unkind. The data may be contaminated. The prior may be based 
on a misconception. Suppose that reexamination of the theoretical arguments 
leading to the initial prior shows that there was a mistake either of formulation 
or even of calculation and that correction of that mistake leads to conformity 
with the data. Then it is virtually obligatory to allow the prior to change, the 
change having been driven by the data. There are, of course, dangers in such 
retrospective adjustment of priors. In many fields even initially very surprising 
effects can post hoc be made to seem plausible. There is a broadly analogous 
difficulty in frequentist theory. This is the selection of effects for significance 
testing after seeing the data. 
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Temporal coherency is a concern especially, although by no means only, in 
the context of studies continuing over a substantial period of time. 


5.11 Degree of belief and frequency 


In approaches to probability in which personalistic degree of belief is taken as 
the primary concept it is customary to relate that to frequency via a very long 
(infinite) sequence of exchangeable events. These can then be connected to a 
sequence of independent and identically distributed trials, each with the same 
probability p, and in which p itself has a probability distribution. 

Suppose instead that £1, . . . , En aren events, all judged by You to have approx- 
imately the same probability p and not to be strongly dependent. In fact they 
need not, in general, be related substantively. It follows from the Weak Law of 
Large Numbers obeyed by personalistic probability that Your belief that about 
a proportion p of the events are true has probability close to 1. You believe 
strongly that Your probabilities have a frequency interpretation. 

This suggests that in order to elicit Your probability of an event € instead 
of contemplating hypothetical betting games You should try to find events 
E,...,€, judged for good reason to have about the same probability as € and 
then find what proportion of this set is true. That would then indicate appropriate 
betting behaviour. 

Of course, it is not suggested that this will be easily implemented. The argu- 
ment is more one of fundamental principle concerning the primacy, even in a 
personalistically focused approach, of relating to the real world as contrasted 
with total concentration on internal consistency. 


5.12 Statistical implementation of Bayesian analysis 


To implement a Bayesian analysis, a prior distribution must be specified expli- 
citly, i.e., in effect numerically. Of course, as with other aspects of formulation 
formal or informal sensitivity analysis will often be desirable. 

There are at least four ways in which a prior may be obtained. 

First there is the possibility of an implicitly empirical Bayes approach, 
where the word ‘empirical’ in this context implies some kind of frequency 
interpretation. Suppose that there is an unknown parameter referring to the 
outcome of a measuring process that is likely to be repeated in different cir- 
cumstances and quite possibly by different investigators. Thus it might be a 
mean length in metres, or a mean duration in years. These parameters are likely 
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to vary widely in different circumstances, even within one broad context, and 
hence to have a widely dispersed distribution. The use of a very dispersed prior 
distribution might then lead to a posterior distribution with a frequency inter- 
pretation over this ensemble of repetitions. While this interpretation has some 
appeal for simple parameters such as means or simple measures of dispersion, 
it is less appropriate for measures of dependence such as regression coefficients 
and contrasts of means. Also the independence assumptions involved in apply- 
ing this to multiparameter problems are both implausible and potentially very 
misleading; see, for instance, Example 5.5. 

The second approach is a well-specified empirical Bayes one. Suppose that 
a parameter with the same interpretation, for example as a difference between 
two treatments, is investigated in a large number of independent studies under 
somewhat different conditions. Then even for the parameter in, say, the first 
study there is some information in the other studies, how much depending on a 
ratio of components of variance. 


Example 5.8. Components of variance (ctd). We start with a collapsed version 
of the model of Example 1.9 in which now there are random variables Y% 
assumed to have the form 


Yk = u + Nk + &. (5.16) 


Here Yz is the estimate of say the treatment contrast of interest obtained 
from the kth study and ēę is the internal estimate of error in that study. 
That is, indefinite replication of the kth study would have produced the 
value u + nx. The variation between studies is represented by the ran- 
dom variables ng. We suppose that the € and 7 are independently normally 
distributed with zero means and variances o? and oe Note that in this ver- 
sion é will depend on the internal replication within the study. While this 
formulation is an idealized version of a situation that arises frequently in applic- 
ations, there are many aspects of it that would need critical appraisal in an 
application. 

Now suppose that the number of studies is so large that u and the components 
of variance may be regarded as known. If the mean of, say, the first group is 
the parameter of interest it has Y; as its direct estimate and also has the prior 
frequency distribution specified by a normal distribution of mean ju and variance 

2 


o; So that the posterior distribution has mean 


Yı/o? + w/o? 


f 5.17 
1/0} + 1/02 a 
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Thus the direct estimate Y; is shrunk towards the general mean jz; this general 
effect does depend on the assumption that the relevant error distributions are 
not very long-tailed. 

If now errors of estimation of u and the variance components are to be 
included in the analysis there are two possibilities. The first possibility is 
to treat the estimation of u and the variance components from a frequent- 
ist perspective. The other analysis, which may be called fully Bayesian, is 
to determine a prior density for u and the variance components and to 
determine the posterior density of u + 7 from first principles. In such 
a context the parameters jz and the variance components may be called 
hyperparameters and their prior distribution a hyperprior. Even so, within 
formal Bayesian theory they are simply unknowns to be treated like other 
unknowns. 


Now in many contexts empirical replication of similar parameters is not avail- 
able. The third approach follows the personalistic prescription. You elicit Your 
prior probabilities, for example via the kind of hypothetical betting games out- 
lined in Section 5.7. For parameters taking continuous values a more realistic 
notion is to elicit the values of some parameters encapsulating Your prior distri- 
bution in some reasonably simple but realistic parametric form. Note that this 
distribution will typically be multidimensional and any specification of prior 
independence of component parameters may be critical. 

In many, if not almost all, applications of statistical analysis the objective is 
to clarify what can reasonably be learned from data, at least in part as a base 
for public discussion within an organization or within a scientific field. For the 
approach to be applicable in such situations there needs both to be a self-denying 
ordinance that any prior probabilities are in some sense evidence-based, that 
base to be reported, and also for some approximate consensus on their values 
to be achieved. 

What form might such evidence take? One possibility is that there are pre- 
vious data which can reasonably be represented by a model in which some, 
at least, of the parameters, and preferably some of the parameters of interest, 
are in common with the current data. In this case, however, provided a consist- 
ent formulation is adopted, the same results would be achieved by a combined 
analysis of all data or, for that matter, by taking the current data as prior to 
the historical data. This is a consequence of the self-consistency of the laws of 
probability. Note that estimation of hyperparameters will be needed either by 
frequentist methods or via a specification of a prior for them. In this context the 
issue is essentially one of combining statistical evidence from different sources. 
One of the principles of such combination is that the mutual consistency of the 
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information should be checked before merging. We return to this issue in the 
next section. 

A second possibility is that the prior distribution is based on relatively inform- 
ally recalled experience of a field, for example on data that have been seen 
only informally, or which are partly anecdotal, and which have not been care- 
fully and explicitly analysed. An alternative is that it is based on provisional 
not-well-tested theory, which has not been introduced into the formulation of 
the probability model for the data. In the former case addition of an allow- 
ance for errors, biases and random errors in assessing the evidence would be 
sensible. 

There is a broad distinction between two uses of such prior distributions. 
One is to deal with aspects intrinsic to the data under analysis and the other 
is to introduce external information. For example, in extreme cases the dir- 
ect interpretation of the data may depend on aspects about which there is no 
information in the data. This covers such aspects as some kinds of systematic 
error, Measurement errors in explanatory variables, the effect of unobserved 
explanatory variables, and so on. The choices are between considering such 
aspects solely qualitatively, or by formal sensitivity analysis, or via a prior 
distribution over parameters describing the unknown features. In the second 
situation there may be relevant partially formed theory or expert opinion 
external to the data. Dealing with expert opinion may raise delicate issues. 
Some research is designed to replace somewhat vaguely based expert opin- 
ion by explicit data-based conclusions and in that context it may be unwise to 
include expert opinion in the analysis, unless it is done in a way in which 
its impact can be disentangled from the contribution of the data. In some 
areas at least such expert opinion, even if vigorously expressed, may be 
fragile. 

The next example discusses some aspects of dealing with systematic error. 


Example 5.9. Bias assessment. An idealized version of the assessment of bias, 
or systematic error, is that Y is normally distributed with mean u and known 
variance og /n and that the parameter of interest y is u + y, where y is an 
unknown bias or systematic error that cannot be directly estimated from the 
data under analysis, so that assessment of its effect has to come from external 
sources. 

The most satisfactory approach in many ways is a sensitivity analysis in 
which y is assigned a grid of values in the plausible range and confidence 
intervals for w plotted against y. Of course, in this case the intervals are a 
simple translation of those for u but in more complex cases the relation might 
not be so straightforward and if there are several sources of bias a response 
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surface would have to be described showing the dependence of the conclusions 
on components in question. It would then be necessary to establish the region 
in which the conclusions from Y would be seriously changed by the biases. 

An alternative approach, formally possibly more attractive but probably in the 
end less insightful, is to elicit a prior distribution for y, for example a normal 
distribution of mean v and variance t?, based for instance on experience of 
similar studies for which the true outcome and hence the possible bias could 
be estimated. Then the posterior distribution is normal with mean a weighted 
average of Y and v. 


The emphasis on the second and third approaches to the choice of prior is 
on the refinement of analysis by inserting useful information, in the latter case 
information that is not directly assessable as a statistical frequency. 

The final approach returns to the notion of a prior that will be generally 
accepted as a basis for assessing evidence from the data, i.e., which explicitly 
aims to insert as little new information as possible. For relatively simple prob- 
lems we have already seen examples where limiting forms of prior reproduce 
approximately or exactly posterior intervals equivalent to confidence intervals. 
In complex problems with many parameters direct generalization of the simple 
priors produces analyses with manifestly unsatisfactory results; the notion of a 
flat or indifferent prior in a many-dimensional problem is untenable. The notion 
of a reference prior has been developed to achieve strong dependence on the 
data but can be complicated in implementation and the interpretation of the 
resulting posterior intervals, other than as approximate confidence intervals, is 
hard to fathom. 

There is a conceptual conflict underlying this discussion. Conclusions 
expressed in terms of probability are on the face of it more powerful than 
those expressed indirectly via confidence intervals and p-values. Further, in 
principle at least, they allow the inclusion of a richer pool of information. But 
this latter information is typically more fragile or even nebulous as compared 
with that typically derived more directly from the data under analysis. The 
implication seems to be that conclusions derived from the frequentist approach 
are more immediately secure than those derived from most Bayesian analyses, 
except from those of a directly empirical Bayes kind. Any counter-advantages 
of Bayesian analyses come not from a tighter and more logically compelling 
theoretical underpinning but rather from the ability to quantify additional kinds 
of information. 

Throughout the discussion so far it has been assumed that the model for the 
data, and which thus determines the likelihood, is securely based. The following 
section deals briefly with model uncertainty. 
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5.13 Model uncertainty 


The very word model implies idealization of the real system and, except just 
possibly in the more esoteric parts of modern physics, as already noted in 
Section 3.1, it hardly makes sense to talk of a model being true. 

One relatively simple approach is to test, formally or informally, for inconsist- 
encies between the model and the data. Inconsistencies if found and statistically 
significant may nevertheless occasionally be judged unimportant for the inter- 
pretation of the data. More commonly inconsistencies if detected will require 
minor or major modification of the model. If no inconsistencies are uncovered 
it may be sufficient to regard the analysis, whether Bayesian or frequentist, as 
totally within the framework of the model. 

Preferably, provisional conclusions having been drawn from an apparently 
well-fitting model, the question should be considered: have assumptions been 
made that might invalidate the conclusions? The answer will depend on how 
clear-cut the conclusions are and in many cases informal consideration of the 
question will suffice. In many cases analyses based on different models may be 
required; the new models may change the specification of secondary features of 
the model, for example by making different assumptions about the form of error 
distributions or dependence structures among errors. Often more seriously, the 
formulation of the primary questions may be changed, affecting the definition 
and interpretation of the parameter of interest y. 

In frequentist discussions this process is informal; it may, for example, 
involve showing graphically or in tabular form the consequences of those dif- 
ferent assumptions, combined with arguments about the general plausibility of 
different assumptions. In Bayesian discussions there is the possibility of the pro- 
cess of Bayesian model averaging. Ideally essentially the same subject-matter 
conclusions follow from the different models. Provided prior probabilities are 
specified for the different models and for the defining parameters within each 
model, then in principle a posterior probability for each model can be found 
and if appropriate a posterior distribution for a target variable found averaged 
across models, giving each its appropriate weight. The prior distributions for 
the parameters in each model must be proper. 

Unfortunately for many interpretative purposes this procedure in inappro- 
priate, quite apart from difficulties in specifying the priors. An exception is 
when empirical prediction of a new outcome variable, commonly defined for 
all models, is required; the notion of averaging several different predictions 
has some general appeal. The difficulty with interpretation is that seemingly 
similar parameters in different models rarely represent quite the same thing. 
Thus a regression coefficient of a response variable on a particular explanatory 
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variable changes its interpretation if the outcome variable is transformed and 
more critically may depend drastically on which other explanatory variables are 
used. Further, if different reasonably well-fitting models give rather different 
conclusions it is important to know and report this. 


5.14 Consistency of data and prior 


Especially when a prior distribution represents potentially new important 
information, it is in principle desirable to examine the mutual consistency of 
information provided by the data and the prior. This will not be possible for 
aspects of the model for which there is little or no information in the data, but 
in general some comparison will be possible. 

A serious discrepancy may mean that the prior is wrong, i.e., does not cor- 
respond to subject-matter reality, that the data are seriously biased or that the 
play of chance has been extreme. 

Often comparison of the two sources of information can be informal but for a 
more formal approach it is necessary to find a distribution of observable quant- 
ities that is exactly or largely free of unknown parameters. Such a distribution 
is the marginal distribution of a statistic implied by likelihood and prior. 

For example in the discussion of Example 1.1 about the normal mean, the 
marginal distribution implied by the likelihood and the normal prior is that 


Y—m 
Jog /n +v) 


has a standard normal distribution. That is, discrepancy is indicated whenever 
Y and the prior mean m are sufficiently far apart. Note that a very flat prior, i.e., 
one with extremely large v, will not be found discrepant. 

A similar argument applies to other exponential family situations. Thus a 
binomial model with a beta conjugate prior implies a beta-binomial distribution 
for the number of successes, this distribution having known parameters and 
thus in principle allowing exact evaluation of the statistical significance of any 
departure. 


(5.18) 


5.15 Relevance of frequentist assessment 


A key issue concerns the circumstances under which the frequentist interpret- 
ation of a p-value or confidence interval is relevant for a particular situation 
under study. In some rather general sense, following procedures that are in 
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error relatively infrequently in the long run is some assurance for the particular 
case but one would like to go beyond that and be more specific. 

Appropriate conditioning is one aspect already discussed. Another is the 
following. As with other measuring devices a p-value is calibrated in terms of 
the consequences of using it; also there is an implicit protocol for application 
that hinges on ensuring the relevance of the calibration procedure. 

This protocol is essentially as follows. There is a question. A model is for- 
mulated. To help answer the question it may be that the hypothesis Ww = Wo 
is considered. A test statistic is chosen. Data become available. The test stat- 
istic is calculated. In fact it will be relatively rare that this protocol is followed 
precisely in the form just set out. 

It would be unusual and indeed unwise to start such an analysis without some 
preliminary checks of data completeness and quality. Corrections to the data 
would typically not affect the relevance of the protocol, but the preliminary 
study might suggest some modification of the proposed analysis. For example: 


e some subsidiary aspects of the model might need amendment, for example 
it might be desirable to allow systematic changes in variance in a regression 
model; 

e it might be desirable to change the precise formulation of the research 
question, for example by changing a specification of how E(Y) depends on 
explanatory variables to one in which E(log Y) is considered instead; 

e alarge number of tests of distinct hypotheses might be done, all showing 
insignificant departures discarded, while reporting only those showing 
significant departure from the relevant null hypotheses; 

e occasionally the whole focus of the investigation might change to the study 
of some unexpected aspect which nullified the original intention. 


The third of these represents poor reporting practice but does correspond 
roughly to what sometimes happens in less blatant form. 

It is difficult to specify criteria under which the departure from the protocol 
is so severe that the corresponding procedure is useless or misleading. Of the 
above instances, in the first two a standard analysis of the new model is likely 
to be reasonably satisfactory. In a qualitative way they correspond to fitting a 
broader model than the one originally contemplated and provided the fitting 
criterion is not, for example, chosen to maximize statistical significance, the 
results will be reasonably appropriate. That is not the case, however, for the last 
two possibilities. 


Example 5.10. Selective reporting. Suppose that m independent sets of data are 
available each with its appropriate null hypothesis. Each is tested and p’ is the 


5.15 Relevance of frequentist assessment 87 


smallest p-value achieved and H’ the corresponding null hypothesis. Suppose 
that only H’ and p’ are reported. If say m = 100 it would be no surprise to find 
ap’ as small as 0.01. 

In this particular case the procedure followed is sufficiently clearly specified 
that a new and totally relevant protocol can be formulated. The test is based on 
the smallest of m independently distributed random variables, all with a uniform 
distribution under the overall null hypothesis of no effects. If the corresponding 
random variable is P’, then 


P(P’ > x) = (1 — x)”, (5.19) 


because in order that P’ > x it is necessary and sufficient that all individual ps 
exceed x. Thus the significance level to be attached to p’ under this scheme of 
investigation is 


1-(—p’y" (5.20) 
and if p’ is small and m not too large this will be close to mp’. 


This procedure, named after Bonferroni, gives a quite widely useful way of 
adjusting simple p-values. 

There is an extensive set of procedures, of which this is the simplest and most 
important, known under the name of multiple comparison methods. The name 
is, however, somewhat of a misnomer. Many investigations set out to answer 
several questions via one set of data and difficulties arise not so much from 
dealing with several questions, each answer with its measure of uncertainty, but 
rather from selecting one or a small number of questions on the basis of the 
apparent answer. 

The corresponding Bayesian analysis requires a much more detailed spe- 
cification and this point indeed illustrates one difference between the broad 
approaches. It would be naive to think that all problems deserve very detailed 
specification, even in those cases where it is possible in principle to set out such 
a formulation with any realism. Here for a Bayesian treatment it is necessary 
to specify both the prior probability that any particular set of null hypotheses 
is false and the prior distributions holding under the relevant alternative. Form- 
ally there may be no particular problem in doing this but for the prior to reflect 
genuine knowledge considerable detail would be involved. Given this formula- 
tion, the posterior distribution over the set of m hypotheses and corresponding 
alternatives is in principle determined. In particular the posterior probability of 
any particular hypothesis can be found. 

Suppose now that only the set of data with largest apparent effect is con- 
sidered. It would seem that if the prior distributions involve strong assumptions 
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of independence among the sets of data, and of course that is not essential, 
then the information that the chosen set has the largest effect is irrelevant, the 
posterior distribution is unchanged, i.e., no direct allowance for selection is 
required. 

A resolution of the apparent conflict with the frequentist discussion is, how- 
ever, obtained if it is reasonable to argue that such a strategy of analysis is 
most likely to be used, if at all, when most of the individual null hypotheses 
are essentially correct. That is, with m hypotheses under examination the prior 
probability of any one being false may be approximately vo/m, where vo may 
be treated as constant as m varies. Indeed vp might be approximately 1, so that 
the prior expectation is that one of the null hypotheses is false. The dependence 
on m is thereby restored. 

An important issue here is that to the extent that the statistical analysis is 
concerned with the relation between data and a hypothesis about that data, 
it might seem that the relation should be unaffected by how the hypothesis 
came to be considered. Indeed a different investigator who had focused on 
the particular hypothesis H’ from the start would be entitled to use p’. But if 
simple significance tests are to be used as an aid to interpretation and discov- 
ery in somewhat exploratory situations, it is clear that some such precaution 
as the use of (5.20) is essential to ensure relevance to the analysis as imple- 
mented and to avoid the occurrence of systematically wrong answers. In fact, 
more broadly, ingenious investigators often have little difficulty in producing 
convincing after-the-event explanations of surprising conclusions that were 
unanticipated beforehand but which retrospectively may even have high prior 
probability; see Section 5.10. Such ingenuity is certainly important but explan- 
ations produced by that route have, in the short term at least, different status 
from those put forward beforehand. 


5.16 Sequential stopping 


We have in much of the discussion set out a model in which the number n of 
primary observations is a constant not a random variable. While there are many 
contexts where this is appropriate, there are others where in some sense n could 
reasonably be represented as a random variable. Thus in many observational 
studies, even if there is a target number of observations, the number achieved 
may depart from this for a variety of reasons. 

If we can reasonably add to the log likelihood a term for the probability 
density of the associated random variable N that is either known or, more 
realistically, depends on an unknown parameter y, we may justify conditioning 
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on the observed sample size in the Bayesian formulation if the prior densities 
of y and of 0 are independent, or in the frequentist version if the parameters 
are variation-independent. This covers many applications and corresponds to 
current practice. 

The situation is more difficult if the random variables are observed in 
sequence in such a way that, at least in a very formalized setting, when the 
first m observations, denoted collectively by y“”, are available, the probability 
that the (m + 1)th observation is obtained is pm+1( y™) and that otherwise no 
more observations are obtained and N = m. In a formalized stopping rule the 
ps are typically 0 or 1, in that a decision about stopping is based solely on the 
current data. That is, it is assumed that in any discussion about whether to stop 
no additional information bearing on the parameter is used. The likelihood is 
essentially unchanged by the inclusion of such a purely data-dependent factor, 
so that, in particular, any Fisherian reduction of the data to sufficient statistics 
is unaffected after the inclusion of the realized sample size in the statistic; the 
values of intermediate data are then irrelevant. 

In a Bayesian analysis, provided temporal coherency holds, i.e., that the 
prior does not change during the investigation, the stopping rule is irrelevant 
for analysis and the posterior density is computed from the likelihood achieved 
as if n were fixed. In a frequentist approach, however, this is not usually the 
case. In the simplest situation with a scalar parameter 0 and a fixed-sample-size 
one-dimensional sufficient statistic s, the sufficient reduction is now only to the 
(2, 1) family (s, 7) and it is not in general clear how to proceed. 

In some circumstances it can be shown, or more often plausibly assumed as 
an approximation, that is an ancillary statistic. That is to say, knowledge of n 
on its own would give little or no information about the parameter of interest. 
Then it will be reasonable to condition on the value of sample size and to 
consider the conditional distribution of S for fixed sample size, i.e., to ignore 
the particular procedure used to determine the sample size. 


Example 5.11. Precision-based choice of sample size. Suppose that the mean 
of a normal distribution, or more generally a contrast of several means, is being 
estimated and that at each stage of the analysis the standard error of the mean 
or contrast is found by the usual procedure. Suppose that the decision to stop 
collecting data is based on these indicators of precision, not of the estimates of 
primary concern, the mean or contrast of means. Then treating sample size as 
fixed defines a reasonable calibration of the p-value and confidence limits. 


The most extreme departure from treating sample size as fixed arises when 
the statistic s, typically a sum or mean value is fixed. Then it is the inverse 
distribution, i.e., of N given S = s, that is relevant. It can be shown in at least 
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some cases the difference from the analysis treating sample size as fixed is 
small. 


Example 5.12. Sampling the Poisson process. In any inference about the 
Poisson process of rate p the sufficient statistic is (NV, S), the number of points 
observed and the total time of observation. If S is fixed, N has a Poisson dis- 
tribution of mean ps, whereas if N is fixed S has the gamma density of mean 
n/p, namely 


p(ps)"Ye-P5 /(n = 1)!. (5.21) 


To test the null hypothesis = po looking for one-sided departures in the 
direction p > pọ there are two p-values corresponding to the two modes of 
sampling, namely 


EL eP ( pos)" /r! (5.22) 


and 
f eoa — dx. (5.23) 
0 


Repeated integration by parts shows that the integral is equal to (5.22) as is 
clear from a direct probabilistic argument. 

Very similar results hold for the binomial and normal distributions, the latter 
involving the inverse Gaussian distribution. 


The primary situations where there is an appreciable effect of the stopping 
rule on the analysis are those where the emphasis is strongly on the testing of a 
particular null hypothesis, with attention focused strongly or even exclusively 
on the resulting p-value. This is seen most strongly by the procedure that 
stops when and only when a preassigned level of significance is reached in 
formal testing of a given null hypothesis. There is an extensive literature on the 
sequential analysis of such situations. 

It is here that one of the strongest contrasts arises between Bayesian and 
frequentist formulations. In the Bayesian approach, provided that the prior and 
model remain fixed throughout the investigation, the final inference, in partic- 
ular the posterior probability of the null hypothesis, can depend on the data 
only through the likelihood function at the terminal stage. In the frequentist 
approach, the final interpretation involves the data directly not only via that 
same likelihood function but also on the stopping criterion used, and so in par- 
ticular on the information that stopping did not take place earlier. It should 
involve the specific earlier data only if issues of model adequacy are involved; 
for example it might be suspected that the effect of an explanatory variable had 
been changing in time. 
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5.17 A simple classification problem 


A very idealized problem that illuminates some aspects of the different 
approaches concerns a classification problem involving the allocation of a new 
individual to one of two groups, Go and G1. Suppose that this is to be done on 
the basis of an observational vector y, which for group Gg has known density 
Sky) for k = 0, 1. 

For a Bayesian analysis suppose that the prior probabilities are 7g, with 
xo + 1 = 1. Then the posterior probabilities satisfy 


PG |Y¥=y)_ mfi) 
P(Go|Y=y) mo foly) 
One possibility is to classify an individual into the group with the higher pos- 


(5.24) 


terior probability, another is to record the posterior probability, or more simply 
the posterior odds, as a succinct measure of evidence corresponding to each 
particular y. 

The issue of a unique assignment becomes a well-formulated decision prob- 
lem if relative values can be assigned to wo; and w1ọ, the losses of utility 
following from classifying an individual from group Go into group G; and vice 
versa. A rule minimizing expected long-run loss of utility is to classify to group 
G, if and only if 

mA) . wor 
To fo(y) T wio 
The decision can be made arbitrarily in the case of exact equality. 

The data enter only via the likelihood ratio fı (y)/fo(y). One non-Bayesian 

resolution of the problem would be to use just the likelihood ratio itself as a 


(5.25) 


summary measure of evidence. It has some direct intuitive appeal and would 
allow any individual with a prior probability to make an immediate assessment. 

We may, however, approach the problem also more in the spirit of a direct 
frequency interpretation involving hypothetical error rates. For this, derivation 
of a sufficiency reduction is the first step. Write the densities in the form 


fo) expllogi fi() /fomw — kW], (5.26) 


where the parameter space for the parameter of interest y consists of the two 
points 0 and 1. The sufficient statistic is thus the likelihood ratio. 

Now consider testing the null hypothesis that the observation y comes from 
the distribution fo(y), the test to be sensitive against the alternative distribution 
fı(y). This is a radical reformulation of the problem in that it treats the two 
distributions asymmetrically putting emphasis in particular on fo(y) as a null 
hypothesis. The initial formulation as a classification problem treats both dis- 
tributions essentially symmetrically. In the significance testing formulation we 
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use the likelihood ratio as the test statistic, in virtue of the sufficiency reduction 
and calculate as p-value 


/ fo(y)dy, (5.27) 
Lobs 


where Lops is the set of points with a likelihood ratio as great or greater than that 
observed. In the Neyman-Pearson theory of testing hypotheses this is indirectly 
the so-called Fundamental Lemma. 

A formally different approach to this issue forms the starting point to the 
Neyman-Pearson theory of testing hypotheses formalized directly in the Fun- 
damental Lemma. This states, in the terminology used here, that of all test 
statistics that might be used, the probability of exceeding a preassigned p-value 
is maximized under the alternative fı (y) by basing the test on the likelihood 
ratio. Proofs are given in most books on statistical theory and will be omitted. 

Symmetry between the treatment of the two distributions is restored by in 
effect using the relation between tests and confidence intervals. That is, we 
test for consistency with fo(y) against alternative fı (y) and then test again 
interchanging the roles of the two distributions. The conclusion is then in the 
general form that the data are consistent with both specified distributions, with 
one and not the other, or with neither. 

Of course, this is using the data for interpretative purposes not for the clas- 
sification aspect specified in the initial part of this discussion. For the latter 
purpose summarization via the likelihood ratio is simple and direct. 


Example 5.13. Multivariate normal distributions. A simple example illustrat- 
ing the above discussion is to suppose that the two distributions are multivariate 
normal distributions with means jg, yı and known covariance matrix Uo. We 
may, without loss of generality, take uo = 0. The log likelihood ratio is then 


=(y = M) Xp O — m)/2 + Cy" Bp 'y)/2 =y" Ez m — wp Up ‘1/2. 
(5.28) 


The observations thus enter via the linear discriminant function y7 Xo. l u1. The 
various more detailed allocation rules and procedures developed above follow 
immediately. 

The geometric interpretation of the result is best seen by a preliminary linear 
transformation of y so that it has the identity matrix as covariance matrix and 
then the linear discriminant is obtained by orthogonal projection of y onto the 
vector difference of means. In terms of the original ellipsoid the discriminant 
is in the direction conjugate to the vector joining the means; see Figure 5.1. 

If the multivariate normal distributions have unequal covariance matrices the 
discriminant function becomes quadratic in y having the interpretation that an 
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Figure 5.1. (a) The ellipses are contours of constant density for two bivariate 
normal distributions with the same covariance matrix. The dashed lines show the 
direction conjugate to the line joining means. Projection of the observed random 
variable Y parallel to conjugate direction determines linear discriminant. (b) After 
linear transformation component variables are independent with the same variance, 
ellipses become circles and the discriminant corresponds to orthogonal projection 
onto the line joining centres. 
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observation very distant from both vector means is more likely to come from 
the distribution with larger covariance matrix contribution in the direction of 
departure. An extreme case is when the two distributions have the same vector 
mean but different covariance matrices. 


Notes 5 


Section 5.3. For the contradictions that may follow from manipulating fiducial 
probabilities like ordinary probabilities, see Lindley (1958). For the exchange 
paradox, see Pawitan (2000) and for a more detailed discussion Christensen 
and Utts (1992). 


Section 5.5. The discussion of rain in Gothenburg is an attempt to explain R. A. 
Fisher’s notion of probability. 


Section 5.6. Robinson (1979) discusses the difficulties with the notion of 
recognizable subsets. Kalbfleisch (1975) separates the two different reasons for 
conditioning. For a distinctive account of these issues, see Sundberg (2003). 


Section 5.7. For more references on the history of these ideas, see Appendix A. 
See Garthwaite et al. (2005) for a careful review of procedures for eliciting 
expert opinion in probabilistic form. See also Note 4.2. Walley (1991) gives a 
thorough account of an approach based on the notion that You can give only 
upper and lower bounds to Your probabilities. 
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Section 5.8. Again see Appendix A for further references. Stein (1959) gives 
Example 5.5 as raising a difficulty with fiducial probability; his arguments are 
just as cogent in the present context. Mitchell (1967) discusses Example 5.7. 


Section 5.9. Bernardo (1979) introduces reference priors and gives (Bernardo, 
2005) a careful account of their definition, calculation and interpretation. Jaynes 
(1976) discusses finite parameter spaces. Berger (2004) defends the impersonal 
approach from a personalistic perspective. Kass and Wasserman (1996) give a 
comprehensive review of formal rules for selecting prior distributions. Lindley 
(1956) develops the notion of information in a distribution and its relation to 
Shannon information and to entropy. 


Section 5.11. The argument in this section is variously regarded as wrong, as 
correct but irrelevant, and as correct and important to a degree of belief concept 
of probability. 


Section 5.12. Greenland (2005) gives a searching discussion of systematic 
errors in epidemiological studies, arguing from a Bayesian viewpoint. 


Section 5.13. Copas and Eguchi (2005) discuss the effects of local model uncer- 
tainty. A key issue in such discussions is the extent to which apparently similar 
parameters defined in different models retain their subject-matter interpretation. 


Section 5.15. Hochberg and Tamhane (1987) give a systematic account of mul- 
tiple comparison methods. For work emphasizing the notion of false discovery 
rate, see Storey (2002). 


Section 5.16. A treatment of sequential stopping by Wald (1947) strongly based 
on the use of the likelihood ratio between two base hypotheses led to very extens- 
ive development. It was motivated by applications to industrial inspection. More 
recent developments have focused on applications to clinical trials (Whitehead, 
1997). Anscombe (1949, 1953) puts the emphasis on estimation and shows that 
to a first approximation special methods of analysis are not needed, even from a 
frequentist viewpoint. See also Barndorff-Nielsen and Cox (1984). Vaager6 and 
Sundberg (1999) discuss a data-dependent design, the up-and-down method of 
estimating a binary dose-response curve, in which it is the design rather than 
the stopping rule that distorts the likelihood-based estimation of the slope of 
the curve, tending to produce estimates of slope that are too steep. It is not 
clear whether this is a case of failure of standard asymptotics or of very slow 
convergence to the asymptotic behaviour. 
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Section 5.17. The discrimination problem has been tackled in the literature 
from many points of view. The original derivation of the linear discriminant 
function (Fisher, 1936) is as that linear function of the data, normalized to have 
fixed variance, that maximized the difference between the means. As such it 
has probably been more used as a tool of analysis than for discrimination. Later 
Fisher (1940) notes that by scoring the two groups say O and 1 and making a 
least squares analysis of that score on the measured variables as fixed, not only 
was the linear discriminant function recovered as the regression equation, but 
the significance of coefficients and groups of coefficients could be tested by the 
standard normal-theory tests as if the (0, 1) score had a normal distribution and 
the measured variables were fixed. 

To interpret Figure 5.1, in either of the defining ellipses draw chords parallel 
to the line joining the two means. The locus of midpoints of these chords is a 
line giving the conjugate direction. 

The order in probability notation O, (1/,/n) is used extensively in Chapter 6; 
see Note 6.2. 


6 
Asymptotic theory 


Summary. Approximate forms of inference based on local approximations of 
the log likelihood in the neighbourhood of its maximum are discussed. An ini- 
tial discussion of the exact properties of log likelihood derivatives includes a 
definition of Fisher information. Then the main properties of maximum like- 
lihood estimates and related procedures are developed for a one-dimensional 
parameter. A notation is used to allow fairly direct generalization to vector para- 
meters and to situations with nuisance parameters. Finally numerical methods 
and some other issues are discussed in outline. 


6.1 General remarks 


The previous discussion yields formally exact frequentist solutions to a number 
of important problems, in particular concerning the normal-theory linear model 
and various problems to do with Poisson, binomial, exponential and other expo- 
nential family distributions. Of course the solutions are formal in the sense that 
they presuppose a specification which is at best a good approximation and which 
may in fact be inadequate. Bayesian solutions are in principle always available, 
once the full specification of the model and prior distribution are established. 
There remain, however, many situations for which the exact frequentist 
development does not work; these include nonstandard questions about simple 
situations and many models where more complicated formulations are unavoid- 
able for any sort of realism. These issues are addressed by asymptotic analysis. 
That is, approximations are derived on the basis that the amount of inform- 
ation is large, errors of estimation are small, nonlinear relations are locally 
linear and a central limit effect operates to induce approximate normality of 
log likelihood derivatives. The corresponding Bayesian results are concerned 
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with finding approximations to the integrals involved in calculating a posterior 
density. 


6.2 Scalar parameter 


6.2.1 Exact results 


It is simplest to begin with a one-dimensional unknown parameter 0 and log 
likelihood /(@; Y) considered as a function of the vector random variable Y, 
with gradient 


U=U(@) = U(6;Y) = 016; Y)/00. (6.1) 


This will be called the score or more fully the score function. Now, provided 
the normalizing condition 


J fr(y:0)dy=1 (6.2) 
can be differentiated under the integral sign with respect to 0, we have that 
f U (0; y) fy (y;0)dy = 0. (6.3) 
That is, 
E{U(0; Y); 0} = 0, (6.4) 


where the expectation and differentiation take place at the same value of 0. If 
it is legitimate to differentiate also (6.3) under the integral sign it follows that 


[PDP + O: oDd = 0. (6.5) 
Therefore if we define 
i(@) = E{—d71(0; Y)/d67; 6}, (6.6) 
then 
var{U (0; Y);0} = i(@). (6.7) 


The function i(@) is called the expected information or sometimes the Fisher 
information about 0. Note that it and the gradient U are calculated from the 
vector Y representing the full vector of observed random variables. Essentially 
i(0) measures the expected curvature of the log likelihood and the greater the 
curvature the sharper the inference. 
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Example 6.1. Location model (ctd). If Y is a single random variable from the 
density g(y — @), the score is 


dg(y — 0)/d0 (68) 
gly — 8) 
and the information is 
1 [tdg0)/dy}?/80)ldy, (6.9) 


sometimes called the intrinsic accuracy of the family g(.). That the inform- 
ation is independent of 0 is one facet of the variation inherent in the model 
being independent of 6. For a normal distribution of unknown mean and known 
variance og the intrinsic accuracy is 1/ og. 


Example 6.2. Exponential family. Suppose that Y corresponds to a single 
observation from the one-parameter exponential family distribution with 
canonical parameter 0, the distribution thus being written in the form 


m(y) exp{ðs — k(0)}. (6.10) 


Then the score function is s—dk(@)/d@ = s— n, where n is the mean parameter, 
and the information is d?k (0) /d 62, which is equal to var(S; 0). The information 
is independent of 6 if and only if k(@) is quadratic. The distribution is then 
normal, the only exponential family distribution which is a special case of 
a location model in the sense of Example 6.1. 


By an extension of the same argument, in which the new density fy (y; 0 + ô) 
is expanded in a series in 6, we have 


E{U(0);6 + 5} = i(0)6 + O(87). (6.11) 


This shows two roles for the expected information. 
Discussion of the regularity conditions involved in these and subsequent 
arguments is deferred to Chapter 7. 


6.2.2 Parameter transformation 


In many applications the appropriate form for the parameter 0 is clear from 
a subject-matter interpretation, for example as a mean or as a regression 
coefficient. From a more formal viewpoint, however, the whole specification 
is unchanged by an arbitrary smooth (1,1) transformation from 0 to, say, 
p = $(@). When such transformations are under study, we denote the rel- 
evant score and information functions by respectively U® (0; Y),i°(@) and 


U? ($; Y), i? ($). 
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Directly from the definition 
U®{$(0); Y} = U? (0; Y)d0/do (6.12) 
and, because the information is the variance of the score, 
i {6(0)} = 1° @)(d0/d)”. (6.13) 


Example 6.3. Transformation to near location form. If we choose the 
function @(@) so that 


do(6)/dé0 = Ji? (6), (6.14) 


i.e., so that 


0 
$(6) = f Ji? @œ)dx, (6.15) 


then the information is constant in the ġ-parameterization. In a sense, the nearest 
we can make the system to a pure location model is by taking this new paramet- 
erization. The argument is similar to but different from that sometimes used in 
applications, of transforming the observed random variable aiming to make its 
variance not depend on its mean. 

A prior density in which the parameter ¢ is uniform is called a Jeffreys prior. 


For one observation from the exponential family model of Example 6.2 the 
information about the canonical parameter 0 is i ©(6) = k’(6). The information 
about the mean parameter n = k’ (0) is thus 


i (0) = i® (0)/(dn/d0)} = 1/k"(6), (6.16) 


the answer being expressed as a function of 7. This is also the reciprocal of the 
variance of the canonical statistic. 
In the special case of a Poisson distribution of mean jz, written in the form 


(1/y!) exp(y log u — n), (6.17) 


the canonical parameter is 0 = log u, so that k(9) = e°. Thus, on differentiating 
twice, the information about the canonical parameter is e = H, whereas the 
information about 7 = u is 1/u. Further, by the argument of Example 6.3, the 
information is constant on taking as new parameter 


6 
(6) = f el? dx (6.18) 


and, omitting largely irrelevant constants, this suggests transformation to 
¢ = e’? i.e., to = y/u. Roughly speaking, the transformation extends the 
part of the parameter space near the origin and contracts the region of very large 
values. 
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6.2.3 Asymptotic formulation 


The results so far involve no mathematical approximation. Suppose now that 
there is a quantity n which is frequently a sample size or other measure of the 
amount of data, but which more generally is an index of the precision of the 
resulting estimates, such as a reciprocal variance of a measurement procedure. 
The following mathematical device is now used to generate approximations. 

We denote the mean expected information per unit information, i(@)/n, by 
7(0) and assume that as n — oo it has a nonzero limit. Further, we assume that 
U(6)/,/n converges in distribution to a normal distribution of zero mean and 
variance 7(@). Note that these are in fact the exact mean and limiting variance. 

Results obtained by applying the limit laws of probability theory in this 
context are to be regarded as approximations hopefully applying to the data and 
model under analysis. The limiting operation n — oo is a fiction used to derive 
these approximations whose adequacy in principle always needs consideration. 
There is never any real sense in which n — oo; the role of asymptotic analysis 
here is no different from that in any other area of applied mathematics. 

We take the maximum likelihood estimate to be defined by the equation 


U(6:Y) =0, (6.19) 


taking the solution giving the largest log likelihood should there be more than 
one solution. Then the most delicate part of the following argument is to show 
that the maximum likelihood estimate is very likely to be close to the true value. 
We argue somewhat informally as follows. 

By assumption then, as n — oo, i(@)/n — 1(@) > 0, and moreover /(@) 
has the qualitative behaviour of the sum of n independent random variables, 
i.e., is Op(n) with fluctuations that are Op (vn). Finally it is assumed that 
U(@;Y) is asymptotically normal with zero mean and variance i(@). We call 
these conditions standard n asymptotics. Note that they apply to time series 
and spatial models with short-range dependence. For processes with long-range 
dependence different powers of n would be required in the normalization. For 
a review of the o, O, op, Op notation, see Note 6.2. 

It is convenient from now on to denote the notional true value of the parameter 
by 6*, leaving 0 as the argument of the log likelihood function. First, note that by 
Jensen’s inequality 


E{i(@; Y);0*} (6.20) 


has its maximum at 6 = 6*. That is, subject to smoothness, E{/(0; Y)} as a 
function of 0 has its maximum at the true value @*, takes a value that is O(n) 
and has a second derivative at the maximum that also is O(n). Thus unless 
|ð — 6*| = O(1/,/n) the expected log likelihood is small compared with its 
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Figure 6.1. Behaviour of expected log likelihood (solid line) near the true value 
within the band of width O(./n) around the expected value. Asymptotic behaviour 
near maximum as indicated. 


maximum value, and the likelihood itself is exponentially small compared with 
its maximum value. 

Next note that the observed log likelihood will lie within Op (vn) of its 
expectation and because of the smoothness of the observed log likelihood this 
statement applies to the whole function over a range within O(1/,/n) of 0*, 
so that 6 lies within Op(1/./n) of 0*. See Figure 6.1. 

The argument does not exclude the occurrence of other local maxima to /(@) 
but typically the log likelihood evaluated there will be very much less than / (6), 
the overall maximum. Of course, the possibility of multiple maxima may well 
be a considerable concern in implementing iterative schemes for finding é. 

In the light of the above discussion, we expand /(@) about 0* to quadratic 
terms to obtain 


1(@) — 1(0*) = U(6*)(6 — 6*) — (0 — 6*)77(8*) /2, (6.21) 
where 


j(*) = —3°1(0*)/30° (6.22) 
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is called the observed information function at 0* in the light of its connection 
with i(@*), the expected information. In the range of interest within O,(1/ ~n) 
of 0* the next term in the expansion will be negligible under mild restrictions 
on the third derivative of /(@). 

It follows, on differentiating (6.21), then obtaining the corresponding 
expansion for U(@), and finally because U (ô) = 0, that approximately 


6 —0* = j7! (05)U (0%), (6.23) 
which to achieve appropriate normalization is better written 
(6 — 0*)./n = {nj~' (0®HU(0*)/ Vn}. (6.24) 


Now j(0*)/n converges in probability to its expectation i(6*) /n which itself has 
been assumed to converge to 7(0*), the mean expected information per unit n. 

We appeal to the result that if the sequence of random variables Z, converges 
in distribution to a normal distribution of zero mean and variance ø? and if 
the sequence A, converges in probability to the nonzero constant a, then the 
sequence A,,Z, converges in distribution to a normal distribution of zero mean 
and variance a*o7. That is, variation in the A, can be ignored. This can be 
proved by examining the distribution function of the product. 

Now, subject to the continuity of i(@), the following ratios all tend in 
probability to 1 as n —> oo: 


KO JOO; j0. (6.25) 
This means that in the result for the asymptotic distribution of the maximum 
likelihood estimate the asymptotic variance can be written in many asymptotic- 
ally equivalent forms. In some ways the most useful of these employs the third 


form involving what we now denote by ĵ and have called simply the observed 
information, namely 


j=j@). (6.26) 


It follows from (6.24) and (6.25) that, the maximum likelihood estimate is 
asymptotically normally distributed with mean 0* and variance that can be 
written in a number of forms, in particular j—! or i7!(6*). 

Suppose that we make a (1, 1) transformation of the parameter from 0 to ¢ (0). 
The likelihood at any point is unchanged by the labelling of the parameters 
so that the maximum likelihood estimates correspond exactly, ĝ = oô). 
Further, the information in the -parameterization that gives the asymptotic 
variance of ĝ is related to that in the 0-parameterization by (6.13) so that the 
asymptotic variances are related by 


var(p) = var (Ô\ (dọ /d0}?. (6.27) 


6.2 Scalar parameter 103 


This relation follows also directly by local linearization of ¢ (0), the so-called 
delta method. Note that also, by the local linearity, asymptotic normality applies 
to both Ê and ĝ, but that in any particular application one version might be 
appreciably closer to normality than another. 

Thus we may test the null hypothesis 0 = 6 by the test statistic 


(Ê — 80) / i7! (60), (6.28) 


orone ofits equivalents, and confidence intervals can be found either from the set 
of values consistent with the data at some specified level or more conveniently 
from the pivot 


0 = ôy. (6.29) 


6.2.4 Asymptotically equivalent statistics 


Particularly in preparation for the discussion with multidimensional parameters 
it is helpful to give some further discussion and comparison formulated primar- 
ily in terms of two-sided tests of a null hypothesis 6 = 6o using quadratic test 
statistics. 

That is, instead of the normally distributed statistic (6.28) we use its square, 
written, somewhat eccentrically, as 


We = (6 — 0)i(00)(6 — 4). (6.30) 


Because under the null hypothesis this is the square of a standardized normal 
random variable, Wg has under that hypothesis a chi-squared distribution with 
one degree of freedom. 

In the form shown in Figure 6.2, the null hypothesis is tested via the 
squared horizontal distance between the maximum likelihood estimate and the 
null value, appropriately standardized. 

An alternative procedure is to examine the vertical distance, i.e., to see how 
much larger a log likelihood is achieved at the maximum than at the null 
hypothesis. We may do this via the likelihood ratio statistic 


Wr = 2{1(6) — 1(6p)}. (6.31) 


Appeal to the expansion (6.21) shows that Wz and Wz are equal to the first 
order of asymptotic theory, in fact that 


We — Wr = O,(1//n). (6.32) 


Thus under the null hypothesis Wz has a chi-squared distribution with one 
degree of freedom and by the relation between significance tests and confidence 
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Figure 6.2. Three asymptotically equivalent ways, all based on the log likelihood 
function of testing null hypothesis 6 = 6o: Wg, horizontal distance; Wz vertical 
distance; Wy slope at null point. 


regions a confidence set for 0 is given by taking all those values of 0 for which 
1(6) is sufficiently close to /(@). That is, 


{0 : Ô) — LO) < kj2,/2}, (6.33) 


where k, is the upper c point of the chi-squared distribution with one degree of 
freedom, forms a | — c level confidence region for 0. See Figure 6.2. Note that 
unless the likelihood is symmetric around 6 the region will not be symmetric. 
To obtain a one-sided version of Wz more directly comparable to (ô — 00)./ 7 
we may define the signed likelihood ratio statistic as 


sgn(6 — 00) / WL, (6.34) 


treating this as having approximately a standard normal distribution when 
0 = Oo. 

There is a third possibility, sometimes useful in more complicated problems, 
namely to use the gradient of the log likelihood at 6o; if this is too large it 
suggests that the data can be explained much better by changing 0. See again 
Figure 6.2. 

To implement this idea we use the statistic 


Ty = U(6o)/x/i(60), (6.35) 
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i.e., the slope divided by its standard error. Note that for direct comparability 
with the other two one-sided statistics a change of sign is needed. 
In the quadratic version we take 


Wu = U(60; Y)i™! (0) U @o; Y). (6.36) 


It follows directly from the previous discussion that this is to the first order 
equal to both Wg and Wz. 

Note that Wg and Wz would be identical were the log likelihood exactly 
quadratic. Then they would be identical also to Wy were observed rather than 
expected information used in the definitions. Under the standard conditions for 
asymptotic theory this quadratic behaviour is approached locally and asymp- 
totically. In applications the three statistics are very often essentially equal for 
practical purposes; any serious discrepancy between them indicates that the 
standard theory is inadequate in at least some respects. We leave to Section 6.6 
the discussion of what to do then and the broader issue of the relative advantages 
of the forms we have given here. 


6.2.5 Optimality considerations 


One approach to the discussion of the optimality of these procedures is by 
detailed analysis of the properties of the resulting estimates and tests. A more 
direct argument is as follows. In the previous discussion we expanded the log 
likelihood around the true value. We now expand instead around the maximum 
likelihood estimate 6 to give for @ sufficiently close to 6 that 


10) = 16) — (6 — 6)?7/2+ OU/Jn), (6.37) 
or, on exponentiating, that the likelihood has the form 
m(y) exp{—@ — 6)°j/2}{1 + OU//n)}, (6.38) 


where m(y) is a normalizing function. This gives an approximate version of 
the Fisherian reduction and shows that to the first order of asymptotic theory the 
data can be divided into (6, J), which summarize what can be learned given the 
model, and the data conditionally on ô, J), which provide information about 
model adequacy. That is, the maximum likelihood estimate and the observed 
information are approximately sufficient statistics. Note that Ê could be replaced 
by any estimate differing from it by Op (n7!) and there are many such! 
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6.2.6 Bayesian analysis 


In principle once the Bayesian formulation has been established, the calculation 
of the posterior distribution involves only integration and so the role of asymp- 
totic theory, at least in one dimension, is restricted to establishing the general 
form of the answers to be expected and often to simplifying the calculation of 
the normalizing constant in the posterior distribution. 

We distinguish two cases. First, suppose that the prior density, to be denoted 
here by p(@), is continuous and nonzero in the neighbourhood of the true 
value 6*, and, for simplicity, bounded elsewhere. The posterior density of © 
is then 


poyexptt()}/ f plo) epodo. (6.39) 


Now expand /(w) about 6 as in (6.21). Only values of 6 and w near 6 make 
an effective contribution, and the numerator is approximately 


pô) exp{l(@)} exp{—(6 — 6)? 7/2}, (6.40) 


whereas the denominator is, after some calculation, to the same order of 
approximation 


2r)" pÂ) exp{l(6)}// 3. (6.41) 


That is, the posterior distribution is asymptotically normal with mean 6 and 
variance j~!. The approximation technique used here is called Laplace expan- 
sion. The asymptotic argument considered amounts to supposing that the prior 
varies little over the range of values of 0 reasonably consistent with the data 
as judged by the likelihood function. As with all other asymptotic considera- 
tions, the extent to which the assumptions about the prior and the likelihood are 
justified needs examination in each application. 

The caution implicit in the last statement is particularly needed in the second 
case that we now consider where there is an atom of prior probability zo at a null 
hypothesis 6 = 0o and the remaining prior probability is distributed over the 
nonnull values. It is tempting to write this latter part in the form (1 — 779) po0(@), 
where po(@) is some smooth density not depending on n. This is, however, often 
to invite misinterpretation, because in most, if not all, specific applications in 
which a test of such a hypothesis is thought worth doing, the only serious 
possibilities needing consideration are that either the null hypothesis is (very 
nearly) true or that some alternative within a range fairly close to 6ọ is true. This 
suggests that the remaining part of the prior density should usually be taken 
in the form g{(@ — 09)./n}./n, where q(.) is some fixed probability density 
function. This is of scale and location form centred on 99 and with a dispersion 
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of order 1/,/n. As n — oo this concentrates its mass in a local region around 
0o. It then follows that the atom of posterior probability at @ is such that the 
odds against 6, namely {1 — p(® | y)}/p(o | y), are 


(1 — xo) f exp{l()}q{(@ — %)./n}./ndo 
To exp{/(6o)} l 
Now expand the log likelihood around 6, replacing j (00) by nz and the signed 


gradient statistic by Ty = U(60; Y)/./{ni(@9)}. The posterior odds against 69 
are then approximately 


(6.42) 


(1 — po) f exp(vTy — v?/2)q(v//1)//idv/po. (6.43) 


Thus as n — oo the posterior odds are asymptotically a fixed function of the 
test statistic Ty, or indeed of one of the other essentially equivalent statistics 
studied above. That is, for a fixed q(.) the relation between the significance level 
and the posterior odds is independent of n. This is by contrast with the theory 
with a fixed prior in which case the corresponding answer depends explicitly 
on n because, typically unrealistically, large portions of prior probability are in 
regions remote from the null hypothesis relative to the information in the data. 


6.3 Multidimensional parameter 


The extension of the results of the previous section when the parameter 0 is 
d-dimensional is immediate. The gradient U is replaced by the d x 1 gradient 
vector 


U@;,Y) = VI@; Y), (6.44) 


where V = (0/061,...,0/ 064)! forms a column vector of partial derivatives. 
Arguing component by component, we have as before that 


E{U(@;Y);0} =0. (6.45) 


Then, differentiating the rth component of this equation with respect to 6, of ©, 
we have in generalization of (6.7) that the covariance matrix of U(@; Y) is 


cov{U (0; Y); 0} = E{U (0; Y)UT (0; Y)} = i(0), (6.46) 


where the d x d matrix i(@), called the expected information matrix, is the 
matrix of second derivatives 


i(0) = E{-VV"1(0; Y);6}. (6.47) 
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Because this is the covariance matrix of a random vector it is positive definite 
unless there are linear constraints among the components of the gradient vector 
and we assume these absent. 

Further, 


E{U(6; Y);0 + 8} = i(6)5 + O(||5|]”). (6.48) 


Suppose that we transform from 0 by a (1, 1) transformation to new para- 
meters @(@), the transformation having Jacobian 0¢/00 in which the rows 
correspond to the components of ¢ and the columns to those of 6. Adopting the 
notation used in the scalar parameter case, we have that the score vectors are 
related by 


U? (¢; Y) = (80/8$)' UP (0; Y) (6.49) 


and the information matrices by 


i? iP aa 6.50 
($) = 5 ( i x (6.50) 
The inverse information matrices satisfy 
0 ð 
PO =F Pe. 6.51) 


These are the direct extensions of the results of a 6.3. 

In terms of asymptotics, we suppose in particular that U (0; Y) has an asymp- 
totic multivariate normal distribution and that i(0)/n tends to a positive definite 
matrix 7(@). The univariate expansions are replaced by their multivariate forms, 
for example expanding about the true value 0”, so that we can write /(0; Y) in 
the form 


1(6*; Y) + (0 — 0*)TU(0*; Y) — (0 — 6*)"j(6*)(@ — 0*)/2, (6.52) 


where j(6*) is minus the matrix of observed second derivatives. 

Essentially the same argument as before now shows that 6 is asymptotically 
normal with mean 6* and covariance matrix i~!(@*) which can be replaced by 
7 '(6*) or more commonly by j~!, the observed inverse information matrix at 
the maximum likelihood point. 

Now if Z is normally distributed with zero mean and variance v, then Zv_ IZ = 
Z? /v has, by definition, a chi-squared distribution with one degree of freedom. 
More generally if Z is ad x 1 vector having a multivariate normal distribution 
with zero mean and covariance matrix V of full rank d, then ||Z I3 =Z"vV!Z 
has a chi-squared distribution with d degrees of freedom. To see this, note 
that the quadratic form is invariant under nonsingular transformations of Z 
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with corresponding changes in the covariance matrix and that Z can always be 
transformed to d independent standardized normal random variables. See also 
Note 2.6. 

It follows that the pivots 


(0 — 6)'i-'(@)(6 — 6) (6.53) 
or 
(@ — ôT i!o — 6) (6.54) 


can be used to form approximate confidence regions for 0. In particular, the 
second, and more convenient, form produces a series of concentric similar 
ellipsoidal regions corresponding to different confidence levels. 

The three quadratic statistics discussed in Section 6.3.4 take the forms, 
respectively 


We = (6 — 0)" (60) (6 — 9%), (6.55) 
Wi = 2{1(6) — 1(00)}, (6.56) 
Wy = U (bo; Y)" i~! (6) Uo; Y). (6.57) 


Again we defer discussion of the relative merits of these until Sections 6.6 
and 6.11. 


6.4 Nuisance parameters 


6.4.1 The information matrix 


In the great majority of situations with a multidimensional parameter 0, we need 
to write 0 = (Y, à), where w is the parameter of interest and À the nuisance para- 
meter. Correspondingly we partition U (0; Y) into two components Uy (0 : Y) 
and U, (0; Y). Similarly we partition the information matrix and its inverse in 
the form 


i i 
i@) = ho ”) , (6.58) 
lay DAA 
and 
i ivy jw 
i (@)= : (6.59) 
G “ 


There are corresponding partitions for the observed information /. 
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6.4.2 Main distributional results 


A direct approach for inference about w is based on the maximum likeli- 
hood estimate v which is asymptotically normal with mean y and covariance 
matrix iY” or equivalently 7”. In terms of a quadratic statistic, we have for 
testing whether y = wo the form 


We = (Ý — Wo)? GY) Ch — Wo) (6.60) 


with the possibility of using (j7””)~! rather than (i¥”)—!. Note also that even 
if Yo were used in the calculation of i” it would still be necessary to estimate À 
except for those special problems in which the information does not depend on À. 

Now to study y via the gradient vector, as far as possible separated from å, 
it turns out to be helpful to write Uy as a linear combination of U, plus a 
term uncorrelated with U}, i.e., as a linear least squares regression plus an 
uncorrelated deviation. This representation is 


Uy = iyi} Un + Ua, (6.61) 


say, where Uy.) denotes the deviation of Uy from its linear regression on U). 
Then a direct calculation shows that 


cov(Uy..) = ivr (6.62) 


where 
iyya = ivy — iyi iy = GY). (6.63) 


The second form follows from a general expression for the inverse of a 
partitioned matrix. 

A further property of the adjusted gradient which follows by direct evaluation 
of the resulting matrix products by (6.63) is that 


EWU y 450 +8) = ipyardy + OU(|61|7). (6.64) 


i.e., to the first order the adjusted gradient does not depend on i. This has the 
important consequence that in using the gradient-based statistic to test a null 
hypothesis y = wo, namely 


Wu = Uy, Wo, iY” (Wo, à) Uya (40, d), (6.65) 


it is enough to replace à by, for example, its maximum likelihood estimate 
given Wo, or even by inefficient estimates. 

The second version of the quadratic statistic (6.56), corresponding more 
directly to the likelihood function, requires the collapsing of the log likelihood 
into a function of w alone, i.e., the elimination of dependence on À. 
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This might be achieved by a semi-Bayesian argument in which A but not y 
is assigned a prior distribution but, in the spirit of the present discussion, it is 
done by maximization. For given w we define hy to be the maximum likelihood 
estimate of A and then define the profile log likelihood of w to be 


Ip(h) = (Y, Ay), (6.66) 


a function of y alone and, of course, of the data. The analogue of the previous 
likelihood ratio statistic for testing w = Wo is now 


Wi = 2{lp(h) — Ip(Wo)}- (6.67) 
Expansions of the log likelihood about the point (W%,A) show that in the 
asymptotic expansion, we have to the first term that Wz = Wy and there- 


fore that Wz has a limiting chi-squared distribution with dy degrees of freedom 
when y = yo. Further because of the relation between significance tests and 
confidence regions, the set of values of y defined as 


{v : Alp) — IpQh)} < kis} (6.68) 


forms an approximate 1 — c level confidence set for y. 


6.4.3 More on profile likelihood 


The possibility of obtaining tests and confidence sets from the profile log like- 
lihood p(y) stems from the relation between the curvature of /p(w) at its 
maximum and the corresponding properties of the initial log likelihood /(y, 4). 

To see this relation, let Vy and V, denote the dy x 1 and d, x 1 operations 
of partial differentiation with respect to y and À respectively and let Dy denote 
total differentiation of any function of y and iy with respect to w. Then, by 
the definition of total differentiation, 


Dip) = VZ, dy) + VAL AW Dy AG} (6.69) 


Now apply Dy again to get the Hessian matrix of the profile likelihood in 
the form 


Dy Dip Y) = Vy VILY, dy) + {Vy VL Ay) Vy Ay 
+ {VL AW HV VG AGE + Vy AT VVI IO, Ay) 
+ (VyAl, {VVI (Y, Ay) (Vy ih iF bs (6.70) 


The maximum likelihood estimate îy satisfies for all y the equation 
vi ly, ñy) = 0. Differentiate totally with respect to y to give 


Vy Vil(W, Ay) + (Dy Ay, PWV ly, Ay) = 0. (6.71) 
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Thus three of the terms in (6.70) are equal except for sign and the third term is 
zero in the light of the definition of hy. Thus, eliminating Dy ñy , we have that 
the formal observed information matrix calculated as minus the Hessian matrix 
of lp(y) evaluated at m is 


Sew = Sov — Paii h = Iowa (6.72) 
where the two expressions on the right-hand side of (6.72) are calculated from 
I(w, 2). Thus the information matrix for y evaluated from the profile likelihood 
is the same as that evaluated via the full information matrix of all parameters. 


This argument takes an especially simple form when both y and X are scalar 
parameters. 


6.4.4 Parameter orthogonality 


An interesting special case arises when i), = 0, so that approximately jyy = 0. 
The parameters are then said to be orthogonal. In particular, this implies that the 
corresponding maximum likelihood estimates are asymptotically independent 
and, by (6.71), that Dydy = = 0 and, by symmetry, that D) = = 0. In nonortho- 
gonal cases if y changes by O(1/./n), then iy changes by O,(1/./n); for 
orthogonal parameters, however, the change is O,(1/n). This property may 
be compared with that of orthogonality of factors in a balanced experimental 
design. There the point estimates of the main effect of one factor, being contrasts 
of marginal means, are not changed by assuming, say, that the main effects of 
the other factor are null. That is, the Op(1/ n) term in the above discussion is 
in fact zero. 

There are a number of advantages to having orthogonal or nearly orthogonal 
parameters, especially component parameters of interest. Independent errors of 
estimation may ease interpretation, stability of estimates of one parameter under 
changing assumptions about another can give added security to conclusions 
and convergence of numerical algorithms may be speeded. Nevertheless, so 
far as parameters of interest are concerned, subject-matter interpretability has 
primacy. 


Example 6.4. Mixed parameterization of the exponential family. Consider a 
full exponential family problem in which the canonical parameter ¢ and the 
canonical statistic s are partitioned as (¢), 2) and (s1, s2) respectively, thought 
of as column vectors. Suppose that ¢2 is replaced by 72, the corresponding 
component of the mean parameter n = Vk(@), where k(@) is the cumulant 
generating function occurring in the standard form for the family. Then ¢; and 
n2 are orthogonal. 
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To prove this, we find the Jacobian matrix of the transformation from (¢1, n2) 
to (¢1, %2) in the form 


I 0 
(b1,12)/9(1, 92) = ( i . 673) 
VIVI) VoVIk@) 
Here V; denotes partial differentiation with respect to ¢; for / = 1,2. 
Combination with (6.51) proves the required result. Thus, in the analysis of 
the 2 x 2 contingency table the difference of column means is orthogonal to 
the log odds ratio; in a normal distribution mean and variance are orthogonal. 


Example 6.5. Proportional hazards Weibull model. For the Weibull 
distribution with density 


yp(py)”—! exp{—(py)”} (6.74) 


and survivor function, or one minus the cumulative distribution function, 


exp{—(py)”}, (6.75) 


the hazard function, being the ratio of the two, is yp(py)”~!. 

Suppose that Y;,..., Y, are independent random variables with Weibull dis- 
tributions all with the same y. Suppose that there are explanatory variables 
Z1,- - - , Zn such that the hazard is proportional to e7. This is achieved by writing 
the value of the parameter p corresponding to Yx in the form exp{ (œ + Bzx)/y}. 
Here without loss of generality we take Xz, = 0. In many applications z and B 
would be vectors but here, for simplicity, we take one-dimensional explanatory 
variables. The log likelihood is 


E(logy +a + Bz) + (y — 1)E logy, — Lexp(@ + Budyy. (6.76) 


Direct evaluation now shows that, in particular, 


471 
Bl = =0, (6.77) 
dadp 
2] b3) 2 
e( ) =- (6.78) 
dpoy y 
71 0.5771 
E ) = 2e (6.79) 
dyoa y 


where Euler’s constant, 0.5771, arises from the integral 


[0,6] 
f v log vdv. (6.80) 
0 
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Now locally near 6 = 0 the information elements involving 6 are zero or 
small implying local orthogonality of 6 to the other parameters and in particular 
to y. Thus not only are the errors of estimating 6 almost uncorrelated with those 
of the other parameters but, more importantly in some respects, the value of 
by will change only slowly with y. In some applications this may mean that 
analysis based on the exponential distribution, y = 1, is relatively insensitive to 
that assumption, at least so far as the value of the maximum likelihood estimate 
of 6 is concerned. 


6.5 Tests and model reduction 


One of the uses of these results for vector parameters of interest is in some- 
what exploratory investigation of the form of model to be used as a basis for 
interpretation steered by a wish, for various reasons, to use reasonably simple 
models wherever feasible. We may have in addition to common nuisance para- 
meters that the parameter of interest y can be partitioned, say as (YT RR Yf ), 
and that models with only the first g components, g < h, may be preferred if 
reasonably consistent with the data. The individual components will typically 
be vectors; it is also not essential to assume, as here, that the model reductions 
considered are hierarchical. 

We can thus produce a series of maximized log likelihoods involving max- 
imization always over à taken with the full vector y, the vector y with Yp = 0, 
down to the vector y alone, or possibly even with y = 0. If we denote these 


maximized log likelihoods by l}, ...,/9 a formal test that, say, w, = O given 
that Wei) =--- = Yr = 0 is provided by 
2Â, — ig) (6.81) 


tested by comparison with the chi-squared distribution with degrees of freedom 
the dimension of Yg. 

This generates a sequence of values of chi-squared and in a sense the hope is 
that this sequence jumps suddenly from the highly significant to small values, 
all consistent with the relevant degrees of freedom, indicating unambiguously 
the simplest model in the sequence reasonably consistent with the data. 

If the models being considered are not nested within some overall model 
the analysis is more complicated. Suppose that we have two distinct families 
of models f(y; 6) and g(y;@) which can give sufficiently similar results that 
choice between them may be delicate, but which are separate in the sense that 
any particular distribution in one family cannot be recovered as a special case 
of the other. A simple example is where there are n independent and identically 
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distributed components, with one family being exponential and the other log 
normal, so that 0 is one-dimensional and ¢ is two-dimensional. 

There are broadly three approaches to such problems. One is to produce 
a comprehensive model reducing to the two special cases when a specifying 
parameter takes values, say 0 and 1. Such a family might be 


c(Y,0, p) exp{w logf(y;@) + (A — Y) log g(y; 6)}, (6.82) 


used to estimate w and very particularly to examine possible consistency with 
the defining values 0 and 1. 

A simpler and in many ways the most satisfactory approach is to follow the 
prescription relating confidence intervals to significance tests. That is, we take 
each model in turn as the null hypothesis and test, for example by the log like- 
lihood ratio log{f (y; 6 )/2(y; b)}, consistency first with f and then with g. This 
leads to the conclusion that both, either one, or neither model is reasonably 
consistent with the data at whatever significance levels arise. The distributional 
results obtained above for the nested case no longer apply; in particular the 
log likelihood ratio test statistic has support on positive and negative real val- 
ues. While analytical results for the distribution are available, much the most 
effective procedure is simulation. 

A third approach is Bayesian and for this we suppose that the prior prob- 
abilities of the two models are respectively xf, = 1 — my and that the 
conditional prior densities of the parameters given the model are respectively 
prf (0) and p,(). Then the ratio of the posterior probability of model g to that 
of model f, i.e., the odds ratio of g versus f, is 


Tg f exp{le(P)} De()dd 
mp f exp{lp(O)} pp (O)d0 ° 


where lą and lp are the relevant log likelihoods. Separate Laplace expansion of 
numerator and denominator gives that the odds ratio is approximately 


(6.83) 


Te expå,) Pe(b) (27 )4e/2 A7 


A A -—1/2° 
mp expe) pp (0) 2r) 4A" 


(6.84) 


Here, for example, d is the dimension of @ and Ag is the determinant of the 
observed information matrix in model g. 

A crucial aspect of this is the role of the prior densities of the parameters 
within each model. In a simpler problem with just one model a flat or nearly flat 
prior density for the parameter will, as we have seen, disappear from the answer, 
essentially because its role in the normalizing constant for the posterior density 
cancels with its role in the numerator. This does not happen here. Numerator 
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and denominator both have terms expressing the ratio of the maximized likeli- 
hood to that of the approximating normal density of the maximum likelihood 
estimate. Entry of the prior densities is to be expected. If one model has para- 
meter estimates in a region of relatively high prior density that is evidence in 
favour of that model. 

To proceed with a general argument, further approximations are needed. 
Suppose first that the parameter spaces of the two models are of equal dimension, 
dg = dy. We can then set up a (1, 1) correspondence between the two parameter 
spaces, for example by making a value ¢’ correspond to the probability limit 
of 6 when calculated from data distributed according to g(y; ¢’). Then, if the 
prior density pp (@) is determined from pg(?) by the same transformation and 
the determinants of the observed information related similarly, it follows that 
the posterior odds are simply 


ae (6.85) 

ny exp(p) 
and to this level of approximation the discussion of Example 5.13 of the com- 
parison of completely specified models applies with maximized likelihood 
replacing likelihood. The assumptions about the prior densities are by no means 
necessarily true; it is possible that there is substantial prior information about ¢ 
if g is true and only weak prior information about 0 if f is true. 

When, say, dy > dy the argument is even more tentative. Suppose that $ can 
be partitioned into two parts, one of dimension dy that can be matched to 6 
and the other of dimension dg — dy, which is in some sense approximately 
orthogonal to the first part and whose probability limit does not depend on 6. The 
contribution of the latter part can come only from the prior and we suppose that 
it is equivalent to that supplied by nọ fictitious observations, as compared with 
the n real observations we have for analysis. Then the ratio of the prior density to 
the corresponding information contribution is approximately (ng/n)“s-%/?. 

This indicates the penalty to be attached to a maximized likelihood for every 
additional component fitted. This is most concisely represented by replacing a 
maximized log likelihood, say if by 


Ig — (dg/2) logn. (6.86) 


The term in no has been omitted as being small compared with n, although the 
logarithmic dependence on n is warning that in any case the approximation may 
be poor. 

This is called the Bayesian information criterion, BIC, although it was ori- 
ginally suggested from a frequentist argument. A number of somewhat similar 
criteria have been developed that are based on the Ls but all of which aim 
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to provide an automatic criterion, not based on significance-testing considera- 
tions, for the selection of a model. They all involve penalizing the log likelihood 
for the number of parameters fitted. The relevance of any relatively automatic 
method of model selection depends strongly on the objective of the analysis, for 
example as to whether it is for explanation or for empirical prediction. To dis- 
cuss the reasonableness of these and similar procedures would involve detailed 
examination of the strategy to be used in applying statistical methods. This is 
defined to be outside the scope of the present book. 

For interpretative purposes no automatic method of model selection is appro- 
priate and while the sequence of tests sketched above may often give some useful 
guidance, mechanical application is likely to be unwise. 

If the Bayesian formulation is adopted and if the posterior distribution of 
all aspects is reasonably defined a procedure of Bayesian model averaging is 
formally possible. If, for example, the objective is predicting the value of y 
corresponding to some specific values of explanatory variables, then in essence 
predictions from the separate models are weighted by the relevant posterior 
probabilities. For the method to be appropriate for interpretation, it is, however, 
essential that the parameter of interest y is defined so as to have identical 
interpretations in the different models and this will often not be possible. 


6.6 Comparative discussion 


There are thus three broad routes to finding asymptotic test and confid- 
ence regions, two with variants depending on whether observed or expected 
information is used and at what point the information is evaluated. In the one- 
dimensional case there is also the issue of whether equi-tailed intervals are 
appropriate and this has already been discussed in Section 3.3. In most cases 
the numerical differences between the three key asymptotic procedures are of no 
consequence, but it is essential to discuss what to do when there are appreciable 
discrepancies. 

Were the observed log likelihood exactly quadratic around 0 the procedures 
using 7 as the basis for assessing precision would be exactly the same for 
the three procedures. Procedures based directly on the maximum likelihood 
estimate and a measure of precision have substantial advantages for the concise 
summarization of conclusions via an estimate and its estimated precision which, 
via the implied pivot, can be used both for tests and confidence intervals and as 
a base for further analysis, combination with other data and so on. 

They do, however, have the disadvantage, substantial under some circum- 
stances, of not being invariant under nonlinear transformations of 0; the other 
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two statistics do have this exact invariance. Indeed Wz from (6.56) can often 
be regarded as close to the value that Wg from (6.55) with f would take after 
reparameterization to near-quadratic form. There can be additional and related 
difficulties with Wz if j(@) or i(@) changes rapidly with 0 in the relevant range. 

Two further issues concern the adequacy of the normal or chi-squared form 
used to approximate the distribution of the test statistics and behaviour in very 
nonstandard situations. We return to these matters briefly in Section 6.11 and 
Chapter 7. Both considerations tend to favour the use of Wz, the likelihood 
ratio statistic, and a simple summary of the arguments is that this is the safest 
to use in particular applications should any discrepancies appear among the 
different statistics, although in extremely nonstandard cases special analysis is, 
in principle, needed. 

For methods involving the maximum likelihood estimate and an estimate 
of its standard error the preference for staying close to a direct use of the 
likelihood, as well as more elaborate arguments of the kind summarized in 
the next section, suggest a preference for using the observed information 7 
rather than the expected information i (ô). The following examples reinforce 
that preference. 


Example 6.6. A right-censored normal distribution. Let Y,,...,Y;, be inde- 
pendently normally distributed with unknown mean p and variance og , assumed 
for simplicity to be known. Suppose, however, that all observations greater than 
a known constant b are recorded simply as being greater than b, their individual 
values not being available. The log likelihood is 


—E' (yk — u)? / Q0) + rlog &{(u — a)/o0}, (6.87) 


where r is the number of censored observations and ¥’ denotes summation over 
the uncensored observations. 

The sufficient statistic is thus formed from the mean of the uncensored obser- 
vations and the number, r, censored. There is no exact ancillary statistic suitable 
for conditioning. The maximum likelihood estimating equation is 


fi =¥ +roo(n—r)~!v{(u —a)/o0}, (6.88) 


where v(x) = ®’(x)/®(x) is the reciprocal of the Mills ratio. Moreover, the 
observed information is given by 


jog =n—r—rv'{((i — a)/oo}. (6.89) 


An important point is that if r = 0, the inference is unaffected, to the first 
order at least, by the possibility that censoring might have occurred but in 
fact did not; the standard pivot for the mean of a normal distribution is used. 
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This is in line with the Bayesian solution that, except possibly for any impact 
of potential censoring on the prior, the inference for r = 0 would also be 
unaffected. Were the expected information to be used instead of the observed 
information, this simple conclusion would not hold, although in most cases the 
numerical discrepancy between the two pivots would be small. 


The following is a more striking example of the need to consider observed 
rather than expected information. 


Example 6.7. Random walk with an absorbing barrier. Consider a random 
walk in which if at time m the walk is at position k, the position at time m + 1 
is k + 1 with probability 6 and is k — 1 with probability 1 — 0, the steps at 
different times being mutually independent. Suppose that at time zero the walk 
starts at point yo, where yo > 0. Suppose, further, that there is an absorbing 
barrier at zero, so that if the walk ever reaches that point it stays there for ever. 

The following qualitative properties of the walk are relevant. If 0 < 1/2 
absorption at zero is certain. If 0 > 1/2 there are two distinct types of path. 
With probability {(1 — @)/0}° the walk ends with absorption at zero, whereas 
with the complementary probability the walk continues indefinitely reaching 
large values with high probability. 

If one realization of the walk is observed for a long time period, the likelihood 
is 0’+ (1 — 0)’-, where r+} and r_ are respectively the numbers of positive and 
negative steps. Use of the expected information would involve, when 6 > 1/2, 
averaging over paths to absorption and paths that escape to large values and 
would be misleading, especially if the data show speedy absorption with little 
base for inferring the value of 0. Note though that the event of absorption does 
not provide an ancillary statistic. 

A formal asymptotic argument will not be given but the most interesting case 
is probably to allow observation to continue for a large number n of steps, and 
to suppose that yo = bo./n and that 0 = 1/2 + 6/,/n. 


A very similar argument applies to the Markovian birth—death process in 
continuous time, generalizing the binary fission process of Example 2.4. In this 
particles may die as well as divide and if the number of particles reaches zero 
the system is extinct. 


6.7 Profile likelihood as an information summarizer 


We have seen in the previous section some preference for profile log likelihood 
as a basis for inference about the parameter of interest y over asymptotically 
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equivalent bases, in particular over the maximum likelihood estimate i and its 
estimated covariance matrix fY * . If the profile log likelihood is essentially sym- 
metrical locally around v the approaches are nearly equivalent and, especially 
if y is one-dimensional, specification of the maximum likelihood estimate and 
its standard error is more directly interpretable and hence preferable. While a 
parameter transformation may often be used, in effect to correct for asymmetry 
in the profile log likelihood, as already stressed physical interpretability of a 
parameter must take preference over the statistical properties of the resulting 
estimates and so parameter transformation may not always be appropriate. 

An important role for statistical analysis is not only the clarification of what 
can reasonably be learned from the data under analysis, but also the summariz- 
ation of conclusions in a form in which later work can use the outcome of the 
analysis, for example in combination with subsequent related results. Recourse 
to the original data may sometimes be impracticable and, especially in com- 
plex problems, only information about the parameter y of interest may be worth 
retaining. 

Where the profile log likelihood /p(w) is appreciably asymmetric around the 
maximum, or in other cases of notable departure from local quadratic form, it 
may be sensible to record either the profile log likelihood in numerical form or 
to record say the third and even fourth derivatives at the maximum, as well as 
the second, the information element. When independent sets of data potentially 
relevant to y become available the overall profile likelihood can then be recon- 
structed as the sum of the separate contributions. The mutual consistency of 
the various estimates of w can be examined via a chi-squared test and, subject 
to reasonable consistency, a pooled estimate, the overall maximum likelihood 
estimate, found. This analysis implicitly assumes that nuisance parameters in 
different sets of data are distinct or, if that is not the case, that the information 
lost by ignoring common elements is unimportant. 


6.8 Constrained estimation 


In general, the more parameters, the more complicated the estimation problem. 
There are exceptions, however, notably when a model is expanded to a saturated 
form or, more generally, where the (k,d) curved exponential family model is 
expanded to the full (k, k) family. We may sometimes usefully estimate within 
the larger family and then introduce the specific constraints that apply to the 
specific problem. This idea can, in particular, be used to find simple estimates 
asymptotically equivalent to maximum likelihood estimates. We shall relate the 
method more specifically to the exponential family. 
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Suppose that the model of interest is defined by a parameter vector 0 and 
that the extended or covering model for which estimation is simpler is defined 
by o = (6, y). If i(@), partitioned in the usual way, denotes the expected 
information matrix and US = (U A ,U. y ) is the gradient of the log likelihood, the 


maximum likelihood estimates, ĝ,, under the reduced, i.e., originating, model 
and the estimates, @,, under the covering model satisfy after local linearization 


igo (6, — 0) = Up, (6.90) 
igo (Oc — 0) + ioy Pe — y) = Up. (6.91) 


Suppose that y is defined so that the model of interest corresponds to 
y = 0. Then 


6, = Ô; + izg ioy Pe (6.92) 


is close to 6,, the difference being Op(1/n). The expected information may be 
replaced by the observed information. The estimate 6, is essentially the estimate 
under the covering model adjusted for regression on Ŷe, which is an estimate of 
zero. If the regression adjustment were appreciable this would be an indication 
that the reduced model is inconsistent with the data. 

Realistic applications of this idea are to relatively complex contingency table 
problems and to estimation of normal-theory covariance matrices subject to 
independence constraints. In the latter case the covering model specifies an 
unconstrained covariance matrix for which estimation is simple. The following 
is an artificial example which shows how the method works. 


Example 6.8. Curved exponential family model. After reduction by sufficiency, 
suppose that (Y;, Y) are independently normally distributed with mean (8, b8?) 
and variances og /n, where og is known. Here b is a known constant. This 
forms a (2, 1) exponential family and there is no ancillary statistic to reduce the 
problem further. Consider the covering model in which the mean is (6, b8? +y). 
Then the estimates under the covering model are 6. = Vi, Pe = Y> — bY B The 
information matrix at y = 0 has elements 


igo = (1+ 4b°6")n/og, igy = 2b0n/og, i, =n/oċ. (6.93) 
Thus, on replacing 0 in the matrix elements by the estimate Y, we have that 
2bY, 
(1+ 4b?¥7) 


6, = Yı + (—bY? + Po). (6.94) 
That is, the adjusted estimate differs from the simple estimate Y; by an 
amount depending on how far the observations are from the curve defining 
the exponential family. 
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Direct calculation with the information matrices shows that, in contrast with 
var(Y}) = oe /n, the new estimate has asymptotic variance 


(of /n)(1 + 4b707)~!. (6.95) 


We now take a different route to closely related results. Consider the 
exponential family model with log likelihood 


sTo —k(@). (6.96) 


Recall that the mean parameter is n = Vk(@), and that the information matrix 
i(@) is the Jacobian of the transformation from ¢ to 7. 

Suppose that the model is constrained by requiring some components of the 
canonical parameter to be constant, often zero, i.e., 


gd =90 (CCC). (6.97) 
We therefore consider a Lagrangian with Lagrange multipliers ve, namely 


sTo —k(b) — Eccc vege. (6.98) 


If we differentiate with respect to @m for m not in C we have Sm = fm, 
whereas for m C C, we have that 


Sm — Àm = Vm = 0. (6.99) 


The Lagrange multipliers vm are determined by solving the simultaneous 
equations obtained by expressing ¢ as a function of 7 as 


@e(su,sc — v) = 0, (6.100) 


where sy are the components of the canonical statistic not in the conditioning 
set C. That is, the estimates of the mean parameters not involved in the con- 
straints are unaffected by the constraints, whereas the estimates of the other 
mean parameters are defined by enforcing the constraints. 

It can be shown that if standard n asymptotics holds, then v is O(1/,/n) and 
expansion of (6.100) yields essentially the noniterative procedure given at the 
beginning of this section. 

A similar analysis, different in detail, applies if the constraints pertain to the 
mean parameter 7 rather than to the canonical parameter @. In this case the 
Lagrangian is 


sTo —k(¢) —@' Ne, (6.101) 


with Lagrange multipliers w. 
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Differentiate with respect to m to give maximum likelihood estimating 
equations 


Sm — m — VOLIAK/Obm = 0, (6.102) 
i.e., 
Sm — Tm — îkm@k = 0. (6.103) 
For m C C we require m = 0, so that 
sc =Icca, (6.104) 


where îcc is the part of the information matrix referring to the variables in C 
and evaluated at the maximum likelihood point. Similarly, for the variables 
U notin C, 


hu = sy — tpciccse. (6.105) 


If we have n replicate observations and standard n asymptotics apply we may 
again replace 7 by simpler estimates with errors O,(1/,/n) to obtain noniterat- 
ive estimates of ny that are asymptotically equivalent to maximum likelihood 
estimates. 


Example 6.9. Covariance selection model. A special case, arising in particular 
in the discussion of Markovian undirected graphical systems, is based on inde- 
pendent and identically distributed p x 1 random vectors normally distributed 
with covariance matrix X. There is no essential loss of generality in supposing 
that the mean is zero. The log likelihood for a single observation is 


(log detx~! — yT ET !y)/2 (6.106) 
and the second term can be written 
—tr{=~!(yy7)}/2. (6.107) 


Here the trace, tr, of a square matrix is the sum of its diagonal elements. 
It follows that for the full data the log likelihood is 


{nlog det=~! — tr(x~!S)} /2, (6.108) 


where S is the matrix of observed sums of squares and products. The concen- 
tration matrix is the canonical parameter, the canonical statistic is S and the 
mean parameter n}. It follows that in the unconstrained model the maximum 
likelihood estimate of £, obtained by equating the canonical statistic to its 
expectation, is the sample covariance matrix T/n. 

Now consider the constrained model in which the elements of =~! in the set C 
are zero. The statistical interpretation is that the pairs of components in C are 
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conditionally independent given the remaining components. It follows from the 
general discussion that the maximum likelihood estimates of the covariances of 
pairs of variables not in C are unchanged, i.e., are the sample covariances. The 
remaining estimates are defined by forcing the implied concentration elements 
to zero as required by the constraints. Approximations can be obtained by 
replacing the information matrices by large-sample approximations. 


6.9 Semi-asymptotic arguments 


In some applications relatively simple exact inference would be available about 
the parameter of interest y were the nuisance parameter à known. If errors of 
estimating A are relatively small the following argument may be used. 

Suppose that to test the hypothesis yy = Wo for known A the p-value cor- 
responding to data y is a function of part of the data yy and is p(yy, Wo; A). 
Suppose that conditionally on yy the estimate i of A has expectation A+ o(1/n) 
and covariance matrix v(A)/n. Now taking expectations over the distribution of 
À given yy, we have, after Taylor expansion to two terms, that 


A 1 
Etp(yy, Wo. A)} = Py, Wo. A) + zt OVViPOy, Wo, à)}. (6.109) 


Now an estimated probability with expectation equal to a true value will 
itself, subject to lying in (0, 1), have a probability interpretation. This leads to 
consideration of 


PYy, Wo, À) — (2n) !tr{v(Ä) Vi Vi pyy, Wo, A}. (6.110) 


This is most safely expressed in the asymptotically equivalent form 
PYy, Wo, À + x), where x is any solution of 


XV ,p = —(2n) 'tr(vV Vp). (6.111) 
Thus if A is a scalar we take 
-  v(A)d*p/da7 
, Yo, à — ——.——_.— t. 6.112 
By 7+: Yo 2nðp/ðA hl) 


Example 6.10. Poisson-distributed signal with estimated background. 
Suppose that in generalization of Example 3.7 we observe two random variables 
Y and Yg having independent Poisson distributions with means respectively 
(Ww + A)ts and Atg. Thus Y refers to a signal plus background observed for a 
time ts and Yg to the background observed for a time tg. We consider a situ- 
ation in which tg is long compared with ts, so that the background is relatively 
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precisely estimated. If à were known the p-value for testing yy = Wo sensitive 
against larger values of y would be 


PCY, Wor A) = Eo expl—(Wo + AdtsH(Wo + A)ts}” /v!. (6.113) 


We write A = te yg and to a sufficient approximation v(A)/n = yg/ i In the 
special case when y = 0 the adjusted p-value becomes 


expl—Yots — yatstg {1 + ts/(2tg)}]. (6.114) 


6.10 Numerical-analytic aspects 


6.10.1 General remarks 


In the majority of applications the maximum of the likelihood /(@) and asso- 
ciated statistics have to be found numerically or in the case of Bayesian 
calculations broadly comparable numerical integrations have to be done. In 
both frequentist and Bayesian approaches it may be fruitful to explore the gen- 
eral character of the likelihood function and, in particular, to compute and plot 
the profile log likelihood in the region of the maximum. The details of these 
procedures raise numerical-analytic rather than statistical issues and will not 
be discussed in depth here. We deal first with deterministic numerical methods 
and then with simulation-based approaches. 


6.10.2 Numerical maximization 


One group of techniques for numerical maximization of a function such as a 
log likelihood can be classified according as specification in readily computable 
form is possible for 


e only function values, 
e function values and first derivatives, 
e function values and first and second derivatives. 


In some situations an initial value of 6 is available that is virtually sure to be close 
to the required maximum. In others it will be wise to start iterative algorithms 
from several or indeed many different starting points and perhaps also to use 
first a very coarse grid search in order to assess the general shape of the surface. 

The most direct method available for low-dimensional problems is grid 
search, i.e., computation of /(@) over a suitably chosen grid of points. It can 
be applied more broadly to the profile likelihood when y is, say, one- or 
two-dimensional, combined with some other method for computing dap Grid 
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search, combined with a plot of the resulting function allows checking for like- 
lihoods of nonstandard shapes, for example those departing appreciably from 
the quadratic form that is the basis of asymptotic theory. 

In particular, a simple but effective way of computing the profile likelihood 
for one- or even two-dimensional parameters of interest, even in relatively com- 
plicated problems, is just to evaluate /(y, à) for a comprehensive grid of values 
of the argument; this in some contexts is feasible provided a reasonably efficient 
algorithm is available for computing l. The envelope of a plot of / against w 
shows the profile log likelihood. This is especially useful if y is a relatively 
complicated function of the parameters in terms of which the likelihood is most 
conveniently specified. 

Next there is a considerable variety of methods of function maximization 
studied in the numerical-analytic literature, some essentially exploiting local 
quadratic approximations applied iteratively. They are distinguished by whether 
none, one or both of the first and second derivatives are specified analytically. 
All such procedures require a convergence criterion to assess when a max- 
imum has in effect been reached. Especially when there is the possibility of a 
somewhat anomalous log likelihood surface, caution is needed in judging that 
convergence is achieved; reaching the same point from very different starting 
points is probably the best reassurance. 

The most widely used method of the first type, i.e, using only function eval- 
uations, is the Nelder-Mead simplex algorithm. In this the log likelihood is 
evaluated at the vertices of a simplex and further points added to the simplex 
in the light of the outcome, for example reflecting the point with lowest log 
likelihood in the complementary plane. There are numerous possibilities for 
changing the relative and absolute sizes of the edges of the simplex. When the 
algorithm is judged to have converged subsidiary calculations will be needed 
to assess the precision of the resulting estimates. 

Many of the procedures are some variant of the Newton—Raphson iteration 
in which the maximum likelihood estimating equation is linearized around 69 
to give 


V1(69) + VV71(6)(6 — 60) = 0, (6.115) 
leading to the iterative scheme 
6-41 — ô; = {VVT}! VIG). (6.116) 


This requires evaluation of first and second derivatives. In many problems of 
even modest complexity the direct calculation of second derivatives can be 
time-consuming or impracticable so that indirect evaluation via a set of first 
derivatives is to be preferred. Such methods are called quasi-Newton methods. 
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These and related methods work well if the surface is reasonably close 
to quadratic in a region including both the starting point and the eventual 
maximum. In general, however, there is no guarantee of convergence. 

An important technique for finding maximum likelihood estimates either 
when there is appreciable missing information which distorts a simple estima- 
tion scheme or where the likelihood is specified indirectly via latent variables 
involves what is sometimes called a self-consistency or missing information 
principle and is encapsulated in what is called the EM algorithm. The gen- 
eral idea is to iterate between formally estimating the missing information and 
estimating the parameters. 

It can be shown that each step of the standard version of the EM algorithm can 
never decrease the log likelihood of Y. Usually, but not necessarily, this implies 
convergence of the algorithm, although this may be to a subsidiary rather than a 
global maximum and, in contrived cases, to a saddle-point of the log likelihood. 

There are numerous elaborations, for example to speed convergence which 
may be slow, or to compute estimates of precision; the latter are not obtainable 
directly from the original EM algorithm. 

In all iterative methods specification of when to stop is important. A long 
succession of individually small but steady increases in log likelihood is a 
danger sign! 


6.10.3 Numerical integration 


The corresponding numerical calculations in Bayesian discussions are ones 
of numerical integration arising either in the computation of the normaliz- 
ing constant after multiplying a likelihood function by a prior density or by 
marginalizing a posterior density to isolate the parameter of interest. 

The main direct methods of integration are Laplace expansion, usually taking 
further terms in the standard development of the expansion, and adaptive quad- 
rature. In the latter standard procedures of numerical integration are used that 
are highly efficient when the function evaluations are at appropriate points. The 
adaptive part of the procedure is concerned with steering the choice of points 
sensibly, especially when many evaluations are needed at different parameter 
configurations. 


6.10.4 Simulation-based techniques 


The above methods are all deterministic in that once a starting point and 
a convergence criterion have been set the answer is determined, except for 


128 Asymptotic theory 


what are normally unimportant rounding errors. Such methods can be contras- 
ted with simulation-based methods. One of the simplest of these in Bayesian 
discussions is importance sampling in which a numerical integration is inter- 
preted as an expectation, which in turn is estimated by repeated sampling. The 
most developed type of simulation procedure is the Markov chain Monte Carlo, 
MCMC, method. In most versions of this the target is the posterior distribution 
of a component parameter. This is manoeuvred to be the equilibrium distribu- 
tion of a Markov chain. A large number of realizations of the chain are then 
constructed and the equilibrium distribution obtained from the overall distribu- 
tion of the state values, typically with an initial section deleted to remove any 
transient effect of initial conditions. 

Although in most cases it will be clear that the chain has reached a stationary 
state there is the possibility for the chain to be stuck in a sequence of transitions 
far from equilibrium; this is broadly parallel to the possibility that iterative 
maximization procedures which appear to have converged in fact have not 
done so. 

One aspect of frequentist discussions concerns the distribution of test statist- 
ics, in particular as a basis for finding p-values or confidence limits. The simplest 
route is to find the maximum likelihood estimate, 6 , of the full parameter vector 
and then to sample repeatedly the model fy (y; 6) and hence find the distribution 
of concern. For testing a null hypothesis only the nuisance parameter need be 
estimated. It is only when the conclusions are in some sense borderline that this 
need be done with any precision. This approach ignores errors in estimating 0. 
In principle the sampling should be conditional on the observed information 7, 
the approximately ancillary statistic for the model. An ingenious and powerful 
alternative available at least in simpler problems is to take the data themselves as 
forming the population for sampling randomly with replacement. This proced- 
ure, called the (nonparametric) bootstrap, has improved performance especially 
if the statistic for study can be taken in pivotal form. The method is related to the 
notion of empirical likelihood, based on an implicit multinomial distribution 
with support the data points. 


6.11 Higher-order asymptotics 


A natural development of the earlier discussion is to develop the asymptotic 
theory to a further stage. Roughly this involves taking series expansions to fur- 
ther terms and thus finding more refined forms of test statistic and establishing 
distributional approximations better than the normal or chi-squared distribu- 
tions which underlie the results summarized above. In Bayesian estimation 
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theory it involves, in particular, using further terms of the Laplace expansion 
generating approximate normality, in order to obtain more refined approxim- 
ations to the posterior distribution; alternatively asymptotic techniques more 
delicate than Laplace approximation may be used and may give substantially 
improved results. 

There are two objectives to such discussions in frequentist theory. The more 
important aims to give some basis for choosing between the large number of pro- 
cedures which are equivalent to the first order of asymptotic theory. The second 
is to provide improved distributional approximations, in particular to provide 
confidence intervals that have the desired coverage probability to greater accur- 
acy than the results sketched above. In Bayesian theory the second objective is 
involved. 

Typically simple distributional results, such as that the pivot (ô —0),/(ni) has 
asymptotically a standard normal distribution, mean that probability statements 
about the pivot derived from the standard normal distribution are in error by 
O(1/,/n) as n increases. One object of asymptotic theory is thus to modify the 
statement so that the error is O(1/n) or better. Recall though that the limiting 
operations are notional and that in any case the objective is to get adequate 
numerical approximations and the hope would be that with small amounts of 
data the higher-order results would give better approximations. 

The resulting theory is quite complicated and all that will be attempted here 
is a summary of some conclusions, a number of which underlie some of the 
recommendations made above. 

First it may be possible sometimes to find a Bayesian solution for the posterior 
distribution in which posterior limits have a confidence limit interpretation 
to a higher order. Indeed to the first order of asymptotic theory the posterior 
distribution of the parameter is normal with variance the inverse of the observed 
information for a wide class of prior distributions, thus producing agreement 
with the confidence interval solution. 

With a single unknown parameter it is possible to find a prior, called a 
matching prior, inducing closer agreement. Note, however, that from conven- 
tional Bayesian perspectives this is not an intrinsically sensible objective. The 
prior serves to import additional information and enforced agreement with 
a different approach is then irrelevant. While a formal argument will not be 
given the conclusion is understandable from a combination of the results in 
Examples 4.10 and 6.3. The location model of Example 4.10 has as its fre- 
quentist analysis direct consideration of the likelihood function equivalent to 
a Bayesian solution with a flat (improper) prior. Example 6.3 shows the trans- 
formation of 0 that induces constant information, one way of approximating 
location form. This suggests that a flat prior on the transformed scale, i.e., a prior 
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density proportional to 


J i(0), (6.117) 


is suitable for producing agreement of posterior interval with confid- 
ence interval to a higher order, the prior thus being second-order match- 
ing. Unfortunately for multidimensional parameters such priors, in general, 
do not exist. 

The simplest general result in higher-order asymptotic theory concerns the 
profile likelihood statistic Wz. Asympotically this has a chi-squared distribution 
with dy degrees of freedom. Were the asymptotic theory to hold exactly we 
would have 


E(W_) = dy. (6.118) 
Typically 
E(W_) = dy (1+ B/n), (6.119) 


where, in general, B is a constant depending on unknown parameters. 
This suggests that if we modify Wz to 


Wi = WL/( + B/n) (6.120) 


the mean of the distribution will be closer to that of the approximating chi- 
squared distribution. Remarkably, not only is the mean improved but the whole 
distribution function of the test statistic becomes closer to the approximating 
form. Any unknown parameters in B can be replaced by estimates. In complic- 
ated problems in which theoretical evaluation of B is difficult, simulation of 
data under the null hypothesis can be used to evaluate B approximately. 

The factor (1 + B/n) is called a Bartlett correction. When W is one- 
dimensional the two-sided confidence limits for y produced from W; do indeed 
have improved coverage properties but unfortunately in most cases the upper 
and lower limits are not, in general, separately improved. 


Notes 6 


Section 6.2. A proof of the asymptotic results developed informally here 
requires regularity conditions ensuring first that the root of the maximum likeli- 
hood estimating equation with the largest log likelihood converges in probability 
to 0. Next conditions are needed to justify ignoring the third-order terms in the 
series expansions involved. Finally the Central Limit Theorem for the score and 
the convergence in probability of the ratio of observed to expected information 
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have to be proved. See van der Vaart (1998) for an elegant and mathematically 
careful account largely dealing with independent and identically distributed 
random variables. As has been mentioned a number of times in the text, a key 
issue in applying these ideas is whether, in particular, the distributional results 
are an adequate approximation in the particular application under study. 

A standard notation in analysis is to write {an} = O(n”) as n — oo to mean 
that an /n? is bounded and to write a, = o(n? ) to mean an /n? — 0. The cor- 
responding notation in probability theory is to write for random variables A, 
that A, = O,(n") or Ay = op(n? ) to mean respectively that A, /n? is bounded 
with very high probability and that it tends to zero in probability respectively. 
Sometimes one says in the former case that a, or An is of the order of mag- 
nitude n’. This is a use of the term different from that in the physical sciences 
where often an order of magnitude is a power of 10. If A, — a = op(1) then An 
converges in probability to a. 

In discussing distributional approximations, in particular normal approxim- 
ations, the terminology is that V, is asymptotically normal with mean and 
variance respectively un and o if the distribution function of (Vp — Mn)/On 
tends to the distribution function of the standard normal distribution. Note that 
HUn and opn refer to the approximating normal distribution and are not necessarily 
approximations to the mean and variance of V,,. 

The relation and apparent discrepancy between Bayesian and frequentist tests 
was outlined by Jeffreys (1961, first edition 1939) and developed in detail by 
Lindley (1957). A resolution broadly along the lines given here was sketched 
by Bartlett (1957); see, also, Cox and Hinkley (1974). 


Section 6.3. Another interpretation of the transformation results for the score 
and the information matrix is that they play the roles respectively of a contra- 
variant vector and a second-order covariant tensor and the latter could be 
taken as the metric tensor in a Riemannian geometry. Most effort on geometrical 
aspects of asymptotic theory has emphasized the more modern coordinate-free 
approach. See Dawid (1975), Amari (1985) and Murray and Rice (1993). For a 
broad review, see Barndorff-Nielsen et al. (1986) and for a development of the 
Riemannian approach McCullagh and Cox (1986). 


Section 6.5. For more on Bayes factors, see Kass and Raftery (1995) and 
Johnson (2005). 


Section 6.8. Asymptotic tests based strongly on the method of Lagrange mul- 
tipliers are developed systematically by Aitchison and Silvey (1958) and Silvey 
(1959). The discussion here is based on joint work with N. Wermuth and 
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G. Marchetti. Covariance selection models involving zero constraints in con- 
centration matrices were introduced by Dempster (1972). Models with zero 
constraints on covariances are a special class of linear in covariance structures 
(Anderson, 1973). Both are of current interest in connection with the study of 
various kinds of Markov independence graphs. 

The discussion of a covering model as a systematic device for generating close 
approximations to maximum likelihood estimates follows Cox and Wermuth 
(1990); R. A. Fisher noted that one step of an iterative scheme for maximum 
likelihood estimates recovered all information in a large-sample sense. 


Section 6.10. Nelder and Mead (1965) give a widely used algorithm based on 
simplex search. Lange (2000) gives a general account of numerical-analytic 
problems in statistics. For the EM algorithm see Dempster et al. (1977) and for 
a review of developments Meng and van Dyk (1997). Sundberg (1974) develops 
formulae for incomplete data from an exponential family model. Efron (1979) 
introduces the nonparametric bootstrap; see Davison and Hinkley (1997) for 
a very thorough account. Markov chain Monte Carlo methods are discussed 
from somewhat different perspectives by Liu (2002) and by Robert and Casella 
(2004). For a wide-ranging account of simulation methods, see Ripley (1987). 
For an example of the envelope method of computing a profile likelihood, see 
Cox and Medley (1989). 


Section 6.11. Higher-order asymptotic theory of likelihood-based statistics 
was initially studied by M. S. Bartlett beginning with a modification of the 
likelihood ratio statistic (Bartlett, 1937) followed by the score statistic (Bartlett, 
1953a, 1953b). Early work explicitly or implicitly used Edgeworth expansions, 
modifications of the normal distribution in terms of orthogonal polynomials 
derived via Taylor expansion of a moment generating function. Nearly 30 years 
elapsed between the introduction (Daniels, 1954) of saddle-point or tilted 
Edgeworth series and their systematic use in inferential problems. Butler (2007) 
gives a comprehensive account of the saddle-point method and its applications. 
Efron and Hinkley (1978) establish the superiority of observed over expected 
information for inference by appeal to conditionality arguments. For a gen- 
eral account, see Barndorff-Nielsen and Cox (1994) and for a wide-ranging 
more recent review Reid (2003). The book by Brazzale et al. (2007) deals both 
with the theory and the implementation of these ideas. This latter work is in a 
likelihood-based framework. It may be contrasted with a long series of papers 
collected in book form by Akahira and Takeuchi (2003) that are very directly 
in a Neyman—Pearson setting. 


7 


Further aspects of maximum likelihood 


Summary. Maximum likelihood estimation and related procedures provide 
effective solutions for a wide range of problems. There can, however, be dif- 
ficulties leading at worst to inappropriate procedures with properties far from 
those sketched above. Some of the difficulties are in a sense mathematical 
pathologies but others have serious statistical implications. The first part of the 
chapter reviews the main possibilities for anomalous behaviour. For illustration 
relatively simple examples are used, often with a single unknown parameter. 
The second part of the chapter describes some modifications of the likelihood 
function that sometimes allow escape from these difficulties. 


7.1 Multimodal likelihoods 


In some limited cases, notably connected with exponential families, convexity 
arguments can be used to show that the log likelihood has a unique maximum. 
More commonly, however, there is at least the possibility of multiple maxima 
and saddle-points in the log likelihood surface. See Note 7.1. 

There are a number of implications. First, proofs of the convergence of 
algorithms are of limited comfort in that convergence to a maximum that is 
in actuality not the overall maximum of the likelihood is unhelpful or worse. 
Convergence to the global maximum is nearly always required for correct inter- 
pretation. When there are two or more local maxima giving similar values to 
the log likelihood, it will in principle be desirable to know them all; the natural 
confidence set may consist of disjoint intervals surrounding these local maxima. 
This is probably an unusual situation, however, and the more common situation 
is that the global maximum is dominant. 

The discussion of Sections 6.4—6.7 of the properties of profile likelihood 
hinges only on the estimate of the nuisance parameter Ày being a stationary 
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point of the log likelihood given y and on the formal information matrix jy y., 
being positive definite. There is thus the disturbing possibility that the profile 
log likelihood appears well-behaved but that it is based on defective estimates 
of the nuisance parameters. 

The argument of Section 6.8 can sometimes be used to obviate some of these 
difficulties. It starts from a set of estimates with simple well-understood proper- 
ties, corresponding to a saturated model, for example. The required maximum 
likelihood estimates under the restricted model are then expressed as small 
corrections to some initial simple estimates. Typically the corrections being 
O,(1/,/n) in magnitude should be of the order of the standard error of the ini- 
tial estimates, and small compared with that if the initial estimates are of high 
efficiency. There are two consequences. If the correction is small, then it is quite 
likely that close to the global maximum of the likelihood has been achieved. If 
the correction is large, the proposed model is probably a bad fit to the data. This 
rather imprecise argument suggests that anomalous forms of log likelihood are 
likely to be of most concern when the model is a bad fit to the data or when the 
amount of information in the data is very small. 

It is possible for the log likelihood to be singular, i.e., to become unbounded 
in a potentially misleading way. We give the simplest example of this. 


Example 7.1. An unbounded likelihood. Suppose that Y,..., Y, are independ- 
ently and identically distributed with unknown mean u and density 


(1 — by — u) + bAT — w)/N}, (7.1) 


where ¢(.) is the standard normal density, A is a nuisance parameter and b is 
a known small constant, for example 1076. Because for most combinations of 
the parameters the second term in the density is negligible, the log likelihood 
will for many values of A have a maximum close to u = y, and y, considered 
as an estimate of u, will enjoy the usual properties. But for u = yx for any k 
the log likelihood will be unbounded as à — 0. Essentially a discrete atom, the 
limiting form, gives an indefinitely larger log likelihood than any continuous 
distribution. To some extent the example is pathological. A partial escape route 
at a fundamental level can be provided as follows. All data are discrete and the 
density is used to give approximations for small bin widths (grouping intervals) 
h. For any value of u for which an observation is reasonably central to a bin 
and for which À is small compared with A one normal component in the model 
attaches a probability of 1 to that observation and hence the model attaches 
likelihood b, namely the weight of the normal component in question. Hence 
the ratio of the likelihood achieved at u = yg and small A to that achieved at 
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u = y with any A not too small is 


by (2m) exp{—n( yx — ¥)?/2} 
hl — b) 


This ratio is of interest because it has to be greater than 1 for the optimum 
likelihood around u = yg to dominate the likelihood at u = y. Unless b/h 
is large and yg is close to y the ratio will be less than 1. The possibility of 
anomalous behaviour with misleading numerical consequences is much more 


(7.2) 


serious with the continuous version of the likelihood. 


7.2 Irregular form 


A key step in the discussion of even the simplest properties is that the maximum 
likelihood estimating equation is unbiased, i.e., that 


E{V1@; Y);0} =0. (7.3) 


This was proved by differentiating under the integral sign the normalizing 
equation 


[roma =1. (7.4) 


If this differentiation is not valid there is an immediate difficulty and we may 
call the problem irregular. All the properties of maximum likelihood estimates 
sketched above are now suspect. The most common reason for failure is that 
the range of integration effectively depends on 0. 

The simplest example of this is provided by a uniform distribution with 
unknown range. 


Example 7.2. Uniform distribution. Suppose that Y4, . . . , Y, are independently 
distributed with a uniform distribution on (0, 0), i.e., have constant density 1/0 
in that range and zero elsewhere. The normalizing condition for n independent 
observations is, integrating in R”, 


6 
A (1/6")dy = 1. (7.5) 
0 


If this is differentiated with respect to @ there is a contribution not only from 
differentiating 1/0 but also one from the upper limit of the integral and the key 
property fails. 

Direct examination of the likelihood shows it to be discontinuous, being zero 
if < yn), the largest observation, and equal to 1/0” otherwise. Thus y(n) is 
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the maximum likelihood estimate and is also the minimal sufficient statistic for 
0. Its properties can be studied directly by noting that 


PY) < 2) = NP < z) = &/8)". (7.6) 
For inference about 6, consider the pivot n(@ — Y(n))/@ for which 
P{n(9 — Yqny)/O < t} = (1 — t/n)”, (7.7) 


so that as n increases the pivot is asymptotically exponentially distributed. In 
this situation the maximum likelihood estimate is a sure lower limit for 0, the 
asymptotic distribution is not Gaussian and, particularly significantly, the errors 
of estimation are Op(1/n) not O,(1/,/n). An upper confidence limit for @ is 
easily calculated from the exact or from the asymptotic pivotal distribution. 


Insight into the general problem can be obtained from the following simple 
generalization of Example 7.2. 


Example 7.3. Densities with power-law contact. Suppose that Y|,..., Y, are 
independently distributed with the density (a + 1)(6 — y)*/0¢*!, where a is 
a known constant. For Example 7.2, a = 0. As a varies the behaviour of the 
density near the critical end-point 0 changes. 

The likelihood has a change of behaviour, although not necessarily a local 
maximum, at the largest observation. A slight generalization of the argument 
used above shows that the pivot 


n!/ C+D (9 = Yin) /0 (7.8) 


has a limiting Weibull distribution with distribution function 1 — exp(—t¢*!). 
Thus for —1/2 < a < 1 inference about 0 is possible from the maximum 
observation at an asymptotic rate faster than 1/,/n. 

Consider next the formal properties of the maximum likelihood estimating 
equation. We differentiate the normalizing equation 


0 
f (a+ DO -y)dy = 1 (7.9) 
0 


obtaining from the upper limit a contribution equal to the argument of the 
integrand there and thus zero if a > 0 and nonzero (and possibly unbounded) 
ifa <0. 

For 0 < a < | the score statistic for a single observation is 


d log f(Y,0)/a0 = a/(0 — Y) —(a+1)/0 (7.10) 


and has zero mean and infinite variance; an estimate based on the largest 
observation is preferable. 
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For a > 1 the normalizing condition for the density can be differentiated 
twice under the integral sign, Fisher’s identity relating the expected information 
to the variance of the score holds and the problem is regular. 


The relevance of the extreme observation for asymptotic inference thus 
depends on the level of contact of the density with the y-axis at the terminal 
point. This conclusion applies also to more complicated problems with more 
parameters. An example is the displaced exponential distribution with density 


p exp{—p(y — 0)} (7.11) 


for y > @ and zero otherwise, which has a discontinuity in density at y = 0. 
The sufficient statistics in this case are the smallest observation yı) and the 
mean. Estimation of 0 by yı) has error O,(1/n) and estimation of p with error 
Op(1/,/n) can proceed as if 6 is known at least to the first order of asymptotic 
theory. The estimation of the displaced Weibull distribution with density 


py {p(y — 9)}”—! expl—{o(y — 8)}"] (7.12) 


for y > @ illustrates the richer possibilities of Example 7.3. 

There is an important additional aspect about the application of these and 
similar results already discussed in a slightly different context in Example 7.1. 
The observations are treated as continuously distributed, whereas all real data 
are essentially recorded on a discrete scale, often in groups or bins of a par- 
ticular width. For the asymptotic results to be used to obtain approximations 
to the distribution of extremes, it is, as in the previous example, important that 
the grouping interval in the relevant range is chosen so that rounding errors 
are unimportant relative to the intrinsic random variability of the continuous 
variables in the region of concern. If this is not the case, it may be more relevant 
to consider an asymptotic argument in which the grouping interval is fixed as 
n increases. Then with high probability the grouping interval containing the 
terminal point of support will be identified and estimation of the position of 
the maximum within that interval can be shown to be an essentially regular 
problem leading to a standard error that is O(1/,/n). 

While the analytical requirements justifying the expansions of the log like- 
lihood underlying the theory of Chapter 6 are mild, they can fail in ways less 
extreme than those illustrated in Examples 7.2 and 7.3. For example the log like- 
lihood corresponding to n independent observations from the Laplace density 
exp(—|y — 0|)/2 leads to a score function of 


— Usgn(y, — 0). (7.13) 


It follows that the maximum likelihood estimate is the median, although if n 
is odd the score is not differentiable there. The second derivative of the log 
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likelihood is not defined, but it can be shown that the variance of the score does 
determine the asymptotic variance of the maximum likelihood estimate. 
A more subtle example of failure is provided by a simple time series problem. 


Example 7.4. Model of hidden periodicity. Suppose that Yj,...,Y, are 
independently normally distributed with variance ø? and with 


E(Y%) = u + a cos(kw) + Bsin(ka), (7.14) 


where (u, @, 6, œ, o?) are unknown. A natural interpretation is to think of the 
observations as equally spaced in time. It is a helpful simplification to restrict 
æ to the values 


Wp =27p/n (p= 1,2,...,[n/2]), (7.15) 


where [x] denotes the integer part of x. We denote the true value of w holding 
in (7.14) by aq. 
The finite Fourier transform of the data is defined by 


Flw) =s Q/n)E Ype? 
= A(wp) + iB (œp). (7.16) 


Because of the special properties of the sequence {wp} the transformation 
from Y1,...,¥n to LY%/./n,A(@1), B(@1),... is orthogonal and hence the 
transformed variables are independently normally distributed with variance o7. 


Further, if the linear model (7.14) holds with w = wg known, it follows that: 


e for p # q, the finite Fourier transform has expectation zero; 

e the least squares estimates of œ and £ are ,/(2/n) times the real and 
imaginary parts of Y (w,), the expectations of which are therefore ,/(n/2)a 
and ./(n/2)B; 

the residual sum of squares is the sum of squares of all finite Fourier 
transform components except those at wg. 


Now suppose that w is unknown and consider its profile likelihood after 
estimating (u, a, B, o’). Equivalently, because (7.14) for fixed w is a normal- 
theory linear model, we may use minus the residual sum of squares. From the 
points listed above it follows that the residual sum of squares varies randomly 
across the values of w except at w = wq, where the residual sum of squares is 
much smaller. The dip in values, corresponding to a peak in the profile likeli- 
hood, is isolated at one point. Even though the log likelihood is differentiable 
as a function of the continous variable œw the fluctuations in its value are so 
rapid and extreme that two terms of a Taylor expansion about the maximum 
likelihood point are totally inappropriate. 
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In a formal asymptotic argument in which (@, 8) are O(1/./n), and so on the 
borderline of detectability, the asymptotic distribution of ô is a mixture of an 
atom at the true value w, and a uniform distribution on (0, 7/2). The uniform 
distribution arises from identifying the maximum at the wrong point. 


7.3 Singular information matrix 


A quite different kind of anomalous behaviour occurs if the score statistic is 
defined but is singular. In particular, for a scalar parameter the score function 
may be identically zero for special values of the parameter. It can be shown that 
this can happen only on rather exceptional sets in the parameter space but if 
one of these points corresponds to a null hypothesis of interest, the singularity 
is of statistical concern; see Note 7.3. 


Example 7.5. A special nonlinear regression. Suppose that Yj,...,Y, are 
independently normally distributed with unit variance and that for constants 
Z1,--+-,Zn not all equal we have one of the models 
E(Y;) = exp(0zj) — 1 — 0z; (7.17) 
or 
E(Y;) = exp(0z;) — 1 — 0z; — (0z))*/2. (7.18) 


Consider the null hypothesis © = 0. For the former model the contribution 
to the score statistic from Y; is 


— (8/90){¥; — exp(Oz) + 1 + 0z;}?/2 (7.19) 


and both this and the corresponding contribution to the information are 
identically zero at 0 = 0. 

It is clear informally that, to extend the initial discussion of maximum like- 
lihood estimation and testing to such situations, the previous series expansions 
leading to likelihood derivatives must be continued to enough terms to obtain 
nondegenerate results. Details will not be given here. Note, however, that loc- 
ally near 9 = 0 the two forms for E(Y;) set out above are essentially (0z))? /2 
and (0z))3/ 6. This suggests that in the former case 6” will be estimated to a 
certain precision, but that the sign of 0 will be estimated with lower precision, 
whereas that situation will not arise in the second model 67, and hence the sign 
of 0, being estimable. 


A more realistic example of the same issue is provided by simple models of 
informative nonresponse. 
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Example 7.6. Informative nonresponse. Suppose that Y1, . . . , Y, are independ- 
ently normally distributed with mean jz and variance o°. For each observation 
there is a possibility that the corresponding value of y cannot be observed. Note 
that in this formulation the individual study object is always observed, it being 
only the value of Y that may be missing. If the probability of being unobserved 
is independent of the value of y the analysis proceeds with the likelihood of the 
observed values; the values of y are missing completely at random. Suppose, 
however, that given Y = y the probability that y is observed has the form 


exp{H (œo + a1(y — 4)/o)}, (7.20) 


where œo, @ are unknown parameters and H (.) is a known function, for example 
corresponding to a logistic regression of the probability of being observed as a 
function of y. Missing completely at random corresponds to a; = 0 and interest 
may well focus on that null hypothesis and more generally on small values 
of Oie 

We therefore consider two random variables (R, Y), where R is binary, taking 
values 0 and 1. The value of Y is observed if and only if R = 1. The contribution 
of a single individual to the log likelihood is thus 


—rlogo — r(y — u) / 20?) + rH{æo + a1 (y — u)/0} 
+ (1 — r) log E[1 — exp{H (œo + a1 (Y — u)/o)}]. (7.21) 


The last term is the overall probability that Y is not observed. The primary 
objective in this formulation would typically be the estimation of u allowing 
for the selectively missing data values; in a more realistic setting the target 
would be a regression coefficient assessing the dependence of Y on explanatory 
variables. 

Expansion of the log likelihood in powers of a; shows that the log likelihood 
is determined by 


ne, Del Vk — u), Uelye — W’, (7.22) 


and so on, where nc is the number of complete observations and Xe denotes 
sum over the complete observations. That is, the relevant functions of the data 
are the proportion of complete observations and the first few sample moments 
of the complete observations. 


If the complete system variance o? 


is assumed known, estimation may be 
based on nc and the first two moments, thereby estimating the three unknown 
parameters (u, œo, a1). If o? is regarded as unknown, then the sample third 
moment is needed also. 

The general difficulty with this situation can be seen in various ways. One is 
that when, for example, a? is assumed known, information about the missing 
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structure can be inferred by comparing the observed sample variance with the 
known value and ascribing discrepancies to the missingness. Not only will such 
comparisons be of low precision but more critically they are very sensitive to the 
accuracy of the presumed value of variance. When the variance is unknown the 
comparison of interest involves the third moment. Any evidence of a nonvan- 
ishing third moment is ascribed to the missingness and this makes the inference 
very sensitive to the assumed symmetry of the underlying distribution. 

More technically the information matrix is singular at the point a; = 0. 
Detailed calculations not given here show a parallel with the simpler regression 
problem discussed above. For example when o? is unknown, expansion yields 
the approximate estimating equation 


a} = {H""(ao)} | Ps, (7.23) 


where 73 is the standardized third cumulant, i.e., third cumulant or moment 
about the mean divided by the cube of the standard deviation. It follows that 
fluctuations in &ı are Op (n—'/°): even apart from this the procedure is fragile 
for the reason given above. 


7.4 Failure of model 


An important possibility is that the data under analysis are derived from a 
probability density g(.) that is not a member of the family f(y; @) originally 
chosen to specify the model. Note that since all models are idealizations the 
empirical content of this possibility is that the data may be seriously incon- 
sistent with the assumed model and that although a different model is to be 
preferred, it is fruitful to examine the consequences for the fitting of the original 
family. 

We assume that for the given g there is a value of 6 in the relevant parameter 
space such that 


E,{V1(O)}o=0, = 0. (7.24) 


Here E, denotes expectation with respect to g. 
The value 6, minimizes 


[eo log{ fy (y; 0)/8(%)}dy. (7.25) 


This, the Kullback—Leibler divergence, arose earlier in Section 5.9 in connection 
with reference priors. 
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A development of (7.24) is to show that the asymptotic distribution of 6 under 
the model g is asymptotically normal with mean 6, and covariance matrix 


E.AVV! log fy (y; 9) }covelV log fy (y; HEV V" log fy(y;@)}}7, (7.26) 


which is sometimes called the sandwich formula. 

The most immediate application of these formulae is to develop likelihood 
ratio tests for what are sometimes called separate families of hypotheses. Sup- 
pose that we regard consistency with the family fy(y; 0) as a null hypothesis 
and either a single distribution gy(y) or more realistically a family gy (y; œ) 
as the alternative. It is assumed that for a given @ there is no value of w that 
exactly agrees with fy(y; @) but that in some sense there are values for which 
there is enough similarity that discrimination between them is difficult with the 
data under analysis. The likelihood ratio statistic 


l (ô) — Ip (0) (7.27) 


can take negative as well as positive values, unlike the case where the null 
hypothesis model is nested within the alternatives. It is clear therefore that the 
asymptotic distribution cannot be of the chi-squared form. In fact, it is normal 
with a mean and variance that can be calculated via (7.26). Numerical work 
shows that approach to the limiting distribution is often very slow so that a more 
sensible course in applications will often be to compute the limiting distribution 
by simulation. In principle the distribution should be found conditionally on 
sufficient statistics for the null model and at least approximately these are 6 and 
the corresponding observed information. Usually it will be desirable to repeat 
the calculation interchanging the roles of the two models, thereby assessing 
whether both, one but not the other, or neither model is adequate in the respect 
considered. 

Sometimes more refined calculations are possible, notably when one or 
preferably both models have simple sufficient statistics. 


7.5 Unusual parameter space 


In many problems the parameter space of possible values is an open set in 
some d-dimensional space. That is, the possibility of parameter points on the 
boundary of the space is not of particular interest. For example, probabilities 
are constrained to be in (0, 1) and the possibility of probabilities being exactly 
O or 1 is usually not of special moment. Parameter spaces in applications are 
commonly formulated as infinite although often in practice there are limits on 
the values that can arise. 
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Constraints imposed on the parameter space may be highly specific to a 
particular application. We discuss a simple example which is easily generalized. 


Example 7.7. Integer normal mean. Suppose that Y is the mean of n inde- 
pendent random variables normally distributed with unknown mean u and 
known variance Op: Suppose that the unknown mean u is constrained to be 
a nonnegative integer; it might, for example, be the number of carbon atoms in 
some moderately complicated molecule, determined indirectly with error. It is 
simplest to proceed via the connection with significance tests. We test in turn 
consistency with each nonnegative integer r, for example by a two-sided test, 
and take all those r consistent with the observed value y as a confidence set. 
That is, at level 1 — 2c the set of integers in the interval 


(© — kro0//n, y + keo0/./n) (7.28) 


forms a confidence set. That is, any true integer value of u is included with 
appropriate probability. Now if o9/,/n is small compared with 1 there is some 
possibility that the above set is null, i.e., no integer value is consistent with the 
data at the level in question. 

From the perspective of significance testing, this possibility is entirely reas- 
onable. To achieve consistency between the data and the model we would have 
to go to more extreme values of c and if that meant very extreme values then 
inconsistency between data and model would be signalled. That is, a test of 
model adequacy is automatically built into the procedure. From a different 
interpretation of confidence sets as sets associated with data a specified pro- 
portion of which are correct, there is the difficulty that null sets are certainly 
false, if the model is correct. This is a situation in which the significance-testing 
formulation is more constructive in leading to sensible interpretations of data. 

These particular formal issues do not arise in a Bayesian treatment. If the 
prior probability that u = r is 7, then the corresponding posterior probability is 


my exp(ryy — ry /2) / X ms exp(svy — sy /2), (7.29) 


s=1 


where y = n/o4- When y is large the posterior probability will, unless the 
fractional part of y is very close to 1/2, concentrate strongly on one value, 
even though that value may, in a sense, be strongly inconsistent with the data. 
Although the Bayesian solution is entirely sensible in the context of the for- 
mulation adopted, the significance-testing approach seems the more fruitful 
in terms of learning from data. In principle the Bayesian approach could, of 
course, be extended to allow some probability on noninteger values. 
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There are more complicated possibilities, for example when the natural 
parameter space is in mathematical terms not a manifold. 


Example 7.8. Mixture of two normal distributions. Suppose that it is required 
to fit a mixture of two normal distributions, supposed for simplicity to have 
known and equal variances. There are then three parameters, two means u1, [42 
and an unknown mixing probability x. Some interest might well lie in the 
possibility that one component is enough. This hypothesis can be represented 
as x = 0, mı arbitrary, or as m = 1, u2 arbitrary, or again as yı = u2, this time 
with x arbitrary. The standard asymptotic distribution theory of the likelihood 
ratio statistics does not apply in this case. 


The general issue here is that under the null hypothesis parameters defined 
under the general family become undefined and therefore not estimable. One 
formal possibility is to fix the unidentified parameter at an arbitrary level and 
to construct the corresponding score statistic for the parameter of interest, then 
maximize over the arbitrary level of the unidentified parameter. This argument 
suggests that the distribution of the test statistic under the null hypothesis is that 
of the maximum of a Gaussian process. While the distribution theory can be 
extended to cover such situations, the most satisfactory solution is probably to 
use the likelihood ratio test criterion and to obtain its null hypothesis distribution 
by simulation. 

If the different components have different unknown variance, the apparent 
singularities discussed in Example 7.1 recur. 


7.6 Modified likelihoods 


7.6.1 Preliminaries 


We now consider methods based not on the log likelihood itself but on some 
modification of it. There are three broad reasons for considering some such 
modification. 


e In models in which the dimensionality of the nuisance parameter is large 
direct use of maximum likelihood estimates may be very unsatisfactory and 
then some modification of either the likelihood or the method of estimation 
is unavoidable. 

e Especially in problems with non-Markovian dependency structure, it may 
be difficult or impossible to compute the likelihood in useful form. Some 
time series and many spatial and spatial-temporal models are like that. 

e The likelihood may be very sensitive to aspects of the model that are 
thought particularly unrealistic. 
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We begin by discussing the first possibility as the most important, although 
we shall give examples of the other two later. 


7.6.2 Simple example with many parameters 


We give an example where direct use of maximum likelihood estimation is 
misleading. That it is a type of model which the exact theory in the first part of 
the book can handle makes comparative discussion easier. 


Example 7.9. Normal-theory linear model with many parameters. Suppose 
that the n x 1 random vector Y has components that are independently normally 
distributed with variance t and that E(Y) = zB, where zis ann x d matrix of full 
rank d < nand £ is ad x 1 vector of unknown parameters. The log likelihood is 


— (n/2) logt — (Y — zB)" (Y — zB)/(2t). (7.30) 


Now suppose that t is the parameter of interest. The maximum likelihood 
estimate of £ is, for all t, the least squares estimate £. It follows that 


@ = (Y — zÊ)" (Y — zB)/n, (7.31) 


the residual sum of squares divided by sample size, not by n — d, the degrees 
of freedom for residual. Direct calculation shows that the expected value of the 
residual sum of squares is (n — d)t, so that 


E) = (1—d/n)r. (7.32) 


Conventional asymptotic arguments take d to be fixed as n increases; the 
factor (1 — d/n) tends to 1 and it can be verified that all the standard proper- 
ties of maximum likelihood estimates hold. If, however, d is comparable with 
n, then Ĉĉ is for large n very likely to be appreciably less than t; systematic 
overfitting has taken place. 

Two quite realistic special cases are first factorial experiments in which the 
number of main effects and interactions fitted may be such as to leave relatively 
few degrees of freedom available for estimating t. A second illustration is a 
matched pair model in which Y is formed from m pairs (Yxo, Yx1) with 


EQ) = Àg — A, E(Yp1) = Àg + A. (7.33) 


Here n = 2m and d = m + 1 so that t underestimates t by a factor of 2. There 
is the special feature in this case that the Àg are incidental parameters, meaning 
that each only affects a limited number of observations, in fact two. Thus the 
supposition underlying maximum likelihood theory that all errors of estimation 
are small is clearly violated. This is the reason in this instance that direct use 
of maximum likelihood is so unsatisfactory. 
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The same explanation does not apply so directly to the factorial design where, 
with a suitable parameterization in terms of main effects and interactions of 
various orders, all regression parameters representing these contrasts can be 
estimated with high precision. The same issue is, however, latent within this 
example in that we could, somewhat perversely, reparameterize into a form in 
which the matrix of coefficients defining the linear model is sparse, i.e., contains 
many zeros, and then the same point could arise. 


While appeal to asymptotic maximum likelihood theory is not needed in the 
previous problem, it provides a test case. If maximum likelihood theory is to be 
applied with any confidence to similar more complicated problems an escape 
route must be provided. There are broadly three such routes. The first is to avoid, 
if at all possible, formulations with many nuisance parameters; note, however, 
that this approach would rule out semiparametric formulations as even more 
extreme. The second, and somewhat related, approach is applicable to problems 
where many or all of the individual parameters are likely to take similar values 
and have similar interpretations. This is to use a random effects formulation in 
which, in the matched pair illustration, the representation 


Yko =u — A+nk +e, Ya =utAtnt ext (1.34) 


is used. Here the ng and the €gs are mutually independent normal random vari- 
ables of zero mean and variances respectively T}, Te and where Te, previously 
denoted by T, is the parameter of interest. 


Example 7.10. A non-normal illustration. In generalization of the previous 
example, and to show that the issues are in no way specific to the normal dis- 
tribution, suppose that (Yjo, Yı) are for / = 1,...,m independently distributed 
in the exponential family form 


m(y) exp{ yo — k()} (7.35) 
with canonical parameters (10, 11) such that 
gn =y, o=- VW. (1.36) 


Thus the log likelihood is 
Zt + wud — Eiki + vw) +k — w}, (7.37) 


where f = yn + yio; dı = yn — yio- 
The formal maximun likelihood equations are thus 


t=k i++ Â- ý), Edi = Ek Â+ WH -k Â- Y). 
(7.38) 
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If we argue locally near Ñ = 0 we have that approximately 
a Xd) 


= ——, (7.39) 
+ 2Xk" (àp) 
the limit in probability of which is 
limn! EK" (À 
La (7.40) 


limp n Ek" Âi) 

In general the factor multiplying y is not 1. If, however, k” (à) is constant 
or linear, i.e, k(.) is quadratic or cubic, the ratio is 1. The former is the special 
case of a normal distribution with known variance and unknown mean and in 
this case the difference in means is the limit in probability of the maximum 
likelihood estimate, as is easily verified directly. 

Another special case is the Poisson distribution of mean u and canonical 
parameter @ = log jz, so that the additive model in ¢ is a multiplicative repres- 
entation for the mean. Here k” (¢) = k() = e? = u. It follows again that the 
ratio in question tends to 1 in probability. It is likely that in all other cases the 
maximum likelihood estimate of w is inconsistent. 


7.6.3 Requirements for a modified likelihood 


The final route to resolving these difficulties with direct use of maximum 
likelihood, which we now develop, involves modifying the likelihood function. 
Suppose that the maximum likelihood estimating equation 


[Vol@y)]g = VIG; y) = 0 (7.41) 
is replaced for some suitable function g by 
Vq(6;y) = 0. (7.42) 


In order for the main properties of maximum likelihood estimates to apply 
preferably in some extended circumstances, we require that: 


e the estimating equation is unbiased in the sense that 
E{Vq(6; Y); 0} = 0; (7.43) 


e the analogue of Fisher’s identity that relates the variance of the score to the 
information holds, namely that 


E{VV"40; Y) + Vq@;Y)V"q(@; Y)} = 0; (7.44) 
e an asymptotic dependence on n obtains similar to that for maximum 


likelihood estimates. Then the asymptotic covariance matrix of 6 is given 
by the inverse of a matrix analogous to an information matrix; 
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e preferably in some sense all or most of the available information is 
recovered. 


We call the first two properties just given first- and second-order validity 
respectively. 

First-order validity is essential for the following discussion to apply. If first- 
but not second-order validity holds the asymptotic covariance matrix is given 
by the slightly more complicated sandwich formula of Section 7.4. 

We now give a number of examples of modified likelihoods to which the 
above discussion applies. 


7.6.4 Examples of modified likelihood 


We begin by returning to Example 7.9, the general linear normal-theory model. 
There are two approaches to what amounts to the elimination from the problem 
of the nuisance parameters, in this case the regression parameter £. 

We can find an orthogonal matrix, b, such that the first d rows are in the space 
spanned by the columns of z. Then the transformed variables V = bY are such 
that E(Vg.1) = 0,...,E(Vin) = 0. Moreover, because of the orthogonality of 
the transformation, the V,; are independently normally distributed with constant 
variance t. The least squares estimates B are a nonsingular transformation of 


V1,---,Vq and the residual sum of squares is RSS = vZ + + ve, 
Therefore the likelihood is the product of a factor from V1, ..., Va, or equi- 
valently from the least squares estimates B , and a factor from the remaining 
variables Vgi1,..., Vn. That is, the likelihood is 
Li (B, T; BYL2 (T; Vat -+ -> Vn) (7.45) 


Suppose we discard the factor Lı and treat L as the likelihood for inference 
about t. There is now a single unknown parameter and, the sample size being 
n — d, the maximum likelihood estimate of t is 


RSS/(n — d), (7.46) 


the standard estimate for which the usual properties of maximum likelihood 
estimates apply provided (n — d) > ow. 

It is hard to make totally precise the condition that L; contains little inform- 
ation about t. Note that if £ is a totally free parameter, in some sense, then an 
exact fit to any values of v;,..., vq can always be achieved in which case it is 
compelling that there is no base left for estimation of t from this part of the 
original data. Any constraint, even a somewhat qualitative one, on 6 would, 
however, change the situation. Thus if it were thought that £ is close to a fixed 
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point, for example the origin, or more generally is near some space of lower 
dimension than d, and the distance of B from the origin or subspace is smaller 
than that to be expected on the basis of RSS/(n—d), then there is some evidence 
that this latter estimate is too large. 

We say that the likelihood L2(t) is directly realizable in that it is the ordin- 
ary likelihood, calculated under the original data-generating specification, of a 
system of observations that could have arisen. It is conceptually possible that 
the original data could have been collected, transformed and only vg+1,...,Vn 
retained. In this and other similar cases, directly realizable likelihoods may be 
very different from the full likelihood of the whole data. 

Such likelihoods are often obtained more generally by a preliminary trans- 
formation of Y to new variables V, W and then by writing the original likelihood 
in the form 


fv; 8) fwiv ; v, 8) (7.47) 


and arranging that one of the factors depends only on the parameter of interest 
w and that there are arguments for supposing that the information about w in 
the other factor is unimportant. Provided the transformation from Y to (V, W) 
does not depend on 0, the Jacobian involved in the transformation of continuous 
variables plays no role and will be ignored. 

The most satisfactory version of this is when the factorization is complete in 
the sense that either 


Svs A) fwy w: v, y) (7.48) 


or 


fv: Y) fwy (w; v, A). (7.49) 


In the former case the information about w is captured by a conditional 
likelihood of W given V and in the latter case by the marginal likelihood of V. 

There are other possible bases for factorizing the likelihood into components. 
When a factor can be obtained that totally captures the dependence on w there is 
no need for the component of likelihood to be directly realizable. The required 
first- and second-order properties of score and information follow directly from 
those of the whole log likelihood, as in the following example. 


Example 7.11. Parametric model for right-censored failure data. For simpli- 
city we consider a problem without explanatory variables. Let failure time have 
density fr (t, Y) and survival function Sy (t, Y) and censoring time have density 
and survival function fc(c; 4) and Sc(c; à). We represent uninformative cen- 
soring by supposing that we have n independent pairs of independent random 
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variables (T, Cx) and that we observe only Yg = min(7;, Cx) and the indicator 
variable Dg = 1 for a failure, i.e., Tg < Cy and zero for censoring. Then the 
likelihood is 


Tf“ (yes OSE Oki A) x TH Or YSK “Os Y). (7.50) 


Here we use the second factor for inference about w, even though it does not 
on its own correspond to a system of observations within the initial stochastic 
specification. Note that the specification assumes that there are no parameters 
common to the distributions of T and C. Were there such common parameters 
the analysis of the second factor would retain key properties, but in general lose 
some efficiency. 

A factorization of the likelihood in which the values of the conditional inform- 
ation measure, iyy.,, were respectively O(1), i.e., bounded, and O(n) would 
asymptotically have the same properties as complete separation. 


7.6.5 Partial likelihood 


We now discuss a particular form of factorization that is best set out in terms of 
a system developing in time although the temporal interpretation is not crucial. 
Suppose that, possibly after transformation, the sequence Y4, . . . , Y, of observed 
random variables can be written T1, $1, ..., Tm, Sm in such a way that the density 
of S; given Cx, all the previous random variables in the sequence, is specified as 
a function of y. Here m is related to but in general different from n. It is helpful 
to think of Sg as determining an event of interest and the Tg as determining 
associated relevant events that are, however, of no intrinsic interest for the 
current interpretation. 
We then call 


Mfs,c, (Sk | Ck = ck; Y) (7.51) 


a partial likelihood based on the Sx. Let Ux and ig(y) denote the contribution 
to the component score vector and information matrix arising from the kth 
term, i.e., from Sg. Then, because the density from which these are calculated 
is normalized to integrate to 1, we have that 


E(Ux) = 0, var(Ux) = ix(w). (7.52) 


Now the total partial likelihood is in general not normalized to 1 so that 
the same argument cannot be applied for the sum of the partial scores and 
information. We can, however, argue as follows. The properties of Ug hold 
conditionally on Cx, the total history up to the defining variable Sg, and hence 
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in particular 
E(U, | U1, ..., Uk-1 = u1, ...,Uk-1) = 0 (7.53) 
and, in particular for l < k, 
E(U/Ux) = E{Uj;E(Ux | U1)} = 9, (7.54) 


so that the components of the total score, while in general not independent, are 
uncorrelated. Therefore the total information matrix is the covariance matrix of 
the total score and, provided standard n asymptotics holds, the usual properties 
of maximum likelihood estimates and their associated statistics are achieved. 
In the more formal theory a martingale Central Limit Theorem will give the 
required asymptotic normality but, of course, the issue of the adequacy of the 
distributional approximation has always to be considered. 


Example 7.12. A fairly general stochastic process. Consider a stochastic pro- 
cess with discrete states treated for simplicity in continuous time and let zero be 
a recurrent point. That is, on entry to state zero the system moves independently 
of its past and the joint density of the time to the next transition and the new 
state, k, occupied is pm (t; Y) for m = 1,2,.... Recurrence means that it returns 
to zero state eventually having described a trajectory with properties of unspe- 
cified complexity. Then the product of the p,,(t; Y) over the exits from state 
zero, including, if necessary, a contribution from a censored exit time, obeys 
the conditions for a partial likelihood. If zero is a regeneration point, then each 
term in the partial likelihood is independent of the past and the partial likelihood 
is directly realizable, i.e., one could envisage an observational system in which 
only the transitions out of zero were observed. In general, however, each term 
Pm(t; Y) could depend in any specified way on the whole past of the process. 


Example 7.13. Semiparametric model for censored failure data. We suppose 
that for each of n independent individuals observation of failure time is subject 
to uninformative censoring of the kind described in Example 7.11. We define, 
as before, the hazard function corresponding to a density of failure time f (t) and 
survival function S(t) to be h(t) = f(t)/S(t). Suppose further that individual / 
has a hazard function for failure of 


ho(t) exp(z} y), (7.55) 


where z; is a vector of possibly time-dependent explanatory variables for indi- 
vidual /, w is a vector of parameters of interest specifying the dependence of 
failure time on z and Ao(t) is an unknown baseline hazard corresponding to an 
individual with z = 0. 


152 Further aspects of maximum likelihood 


Number the failures in order and let C; specify all the information available 
about failures and censoring up to and including the time of the kth failure, 
including any changes in the explanatory variables, but excluding the identity 
of the individual who fails at this point. Let Mg specify the identity of that indi- 
vidual (simultaneous occurrences being assumed absent). Now Cg determines 
in particular Rg, the risk set of individuals who are available to fail at the time 
point in question. Then the probability, given a failure at the time in question, 
that the failure is for individual M; is 


exp{Zm, Who (t)/ Zier, Expl; Who). (7.56) 


Thus the baseline hazard cancels and the product of these expressions over k 
yields a relatively simple partial likelihood for y. 


7.6.6 Pseudo-likelihood 


We now consider a method for inference in dependent systems in which the 
dependence between observations is either only incompletely specified or spe- 
cified only implicitly rather than explicitly, making specification of a complete 
likelihood function either very difficult or impossible. Many of the most import- 
ant applications are to spatial processes but here we give more simple examples, 
where the dependence is specified one-dimensionally as in time series. 

In any such fully specified model we can write the likelihood in the form 


Fr, 13 Ofn 9259139) fy ya- Ons P58), (7.57) 


where in general y“ = (y,,..., yx). Now ina Markov process the dependence 

on y® is restricted to dependence on yg and more generally for an m-dependent 

Markov process the dependence is restricted to the last m components. 
Suppose, however, that we consider the function 


THY Y, (Yk Yk-15 9), (7.58) 


where the one-step conditional densities are correctly specified in terms of an 
interpretable parameter 0 but where higher-order dependencies are not assumed 
absent but are ignored. Then we call such a function, that takes no account of 
certain dependencies, a pseudo-likelihood. 

If Ux(@) denotes the gradient or pseudo-score vector obtained from Yx, 
because each term in the pseudo-likelihood is a probability density normalized 
to integrate to 1, then 


E{U, (0); 0} = 0, (7.59) 
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and the covariance matrix of a single vector Uz can be found from the appropri- 
ate information matrix. In general, however, the U% are not independent and so 
the covariance matrix of the total pseudo-score is not i(@), the sum of the separ- 
ate information matrices. Provided that the dependence is sufficiently weak for 
standard n asymptotics to apply, the pseudo-maximum likelihood estimate ð 
found from (7.58) is asymptotically normal with mean 0 and covariance matrix 


i-'(@)cov(U)i—!(6). (7.60) 


Here U = DU, is the total pseudo-score and its covariance has to be found, 
for example by applying somewhat empirical time series methods to the 
sequence {Uk}. 


Example 7.14. Lag one correlation of a stationary Gaussian time series. Sup- 
pose that the observed random variables are assumed to form a stationary 
Gaussian time series of unknown mean u and variance ø? and lag one cor- 
relation p and otherwise to have unspecified correlation structure. Then Yx 
given Yz—ı = yg— 1 has a normal distribution with mean u + p(yk—-1 — u) and 
variance o7(1 — 0”; this last variance, the innovation variance in a first-order 
autoregressive process, may be taken as a new parameter. Note that if a term 
from Y; is included it will have the marginal distribution of the process, assum- 
ing observation starts in a state of statistical equilibrium. For inference about p 
the most relevant quantity is the adjusted score Up.,,,¢ Of (6.62) and its variance. 
Some calculation shows that the relevant quantity is the variance of 


E{Yk — u)Yk-1 — u) — Yr — 2)*}- (7.61) 


Here u can be replaced by an estimate, for example the overall mean, and the 
variance can be obtained by treating the individual terms in the last expression 
as a Stationary time series, and finding its empirical autocorrelation function. 


Example 7.15. A long binary sequence. Suppose that Y1, . . . , Ym is a sequence 
of binary random variables all of which are mutually dependent. Suppose that 
the dependencies are represented by a latent multivariate normal distribution. 
More precisely, there is an unobserved random vector W1, . . ., Wm having a mul- 
tivariate normal distribution of zero means and unit variances and correlation 
matrix P and there are threshold levels a1, . . . , @, such that Y} = 1 if and only if 
Wk > ax. The unknown parameters of interest are typically those determining 
P and possibly also the a. The data may consist of one long realization or of 
several independent replicate sequences. 

The probability of any sequence of Os and 1s, and hence the likelihood, can 
then be expressed in terms of the m-dimensional multivariate normal distribu- 
tion function. Except for very small values of m there are two difficulties with 
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this likelihood. One is computational in that high-dimensional integrals have to 
be evaluated. The other is that one might well regard the latent normal model as 
a plausible representation of low-order dependencies but be unwilling to base 
high-order dependencies on it. 

This suggests consideration of the pseudo-likelihood 


Hk>ıP(Yk = yk, Yı = y1), (7.62) 


which can be expressed in terms of the bivariate normal integral; algorithms for 
computing this are available. 

The simplest special case is fully symmetric, having a, = «œ and all correl- 
ations equal, say to p. In the further special case œ = 0 of median dichotomy 
the probabilities can be found explicitly because, for example, by Sheppard’s 
formula, 

1 sin-!p 
PY = Yı = 0)= -+ . (7.63) 
4 20 

If we have n independent replicate sequences then first-, but not second-, 
order validity applies, as does standard n asymptotics, and estimates of a and 
p are obtained, the latter not fully efficient because of neglect of information 
in the higher-order relations. 

Suppose, however, that m is large, and that there is just one very long 
sequence. It is important that in this case direct use of the pseudo-likelihood 
above is not satisfactory. This can be seen as follows. Under this special correl- 
ational structure we may write, at least for o > 0, the latent, i.e., unobserved, 
normal random variables, in the form 


We = Vp + Vel — p), (7.64) 


where V,Vj,..., Vm are independent standard normal random variables. It fol- 
lows that the model is equivalent to the use of thresholds at a — V./p with 
independent errors, instead of ones at a. It follows that if we are concerned 
with the behaviour for large m, the estimates of a and p will be close respect- 
ively to a — V./p and to zero. The general moral is that application of this 
kind of pseudo-likelihood to a single long sequence is unsafe if the long-range 
dependencies are not sufficiently weak. 


Example 7.16. Case-control study. An important illustration of these ideas 
is provided by a type of investigation usually called a case-control study; in 
econometrics the term choice-based sampling is used. To study this it is helpful 
to consider as a preliminary a real or notional population of individuals, each 
having in principle a binary outcome variable y and two vectors of explanatory 
variables w and z. Individuals with y = 1 will be called cases and those with 
y = 0 controls. The variables w are to be regarded as describing the intrinsic 
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nature of the individuals, whereas z are treatments or risk factors whose effect on 
y is to be studied. For example, w might include the gender of a study individual, 
z one or more variables giving the exposure of that individual to environmental 
hazards and y might specify whether or not the individual dies from a specific 
cause. Both z and w are typically vectors, components of which may be discrete 
or continuous. 

Suppose that in the population we may think of Y,Z,W as random 
variables with 


PY=1|W=w,Z=2 = L(a + f'z+y'w), (7.65) 


where L(x) = e*/(1 + e*) is the unit logistic function. Interest focuses on 
B, assessing the effect of z on Y for fixed w. It would be possible to include 
an interaction term between z and w without much change to the following 
discussion. 

Now if the response y = 1 is rare and also if it is a long time before the 
response can be observed, direct observation of the system just described can 
be time-consuming and in a sense inefficient, ending with a very large number 
of controls relative to the number of cases. Suppose, therefore, instead that data 
are collected as follows. Each individual with outcome y, where y = 0,1, is 
included in the data with conditional probability, given the corresponding z, w, 
of py(w) and z determined retrospectively. For a given individual, D denotes its 
inclusion in the case-control sample. We write 


P(D | Y =y,Z =z, W = w) = p(w). (7.66) 


It is crucial to the following discussion that the selection probabilities do 
not depend on z given w. In applications it would be quite common to take 
pı(w) = 1, i.e., to take all possible cases. Then for each case one or more 
controls are selected with probability of selection defined by po(w). Choice of 
Po(w) is discussed in more detail below. 

It follows from the above specification that for the selected individuals, we 
have in a condensed notation that 


P(Y = 1 | D,z,w) 
_ PY =1]|z,w)P@D |1,z,w) 
P(D |z,w) 
7 Lia + 7z +y'w)piw) 
~ La + plz + yTw)pi(w) + {1 — Liat Tz + yTw)}pow) 
= L{æ + B'z + y'w + log{pı (w)/po(w)}}. (7.67) 


There are now two main ways of specifying the choice of controls. In the 
first, for each case one or more controls are chosen that closely match the case 
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with respect to w. We then write for the kth such group of a case and controls, 
essentially without loss of generality, 


yw + log{p1(w)/po(w)} = Àx, (7.68) 


where A, characterizes this group of individuals. This representation points 
towards the elimination of the nuisance parameters A, by conditioning. An 
alternative is to assume for the entire data that approximately 


log{p1(w)/po(w)} = n + ¿”w (7.69) 
pointing to the unconditional logistic relation of the form 
P(Y =1|D,z,w) = L(a* + Tz + y“Tw). (7.70) 


The crucial point is that 6, but not the other parameters, takes the same value 
as the originating value in the cohort study. 

Now rather than base the analysis directly on one of the last two formulae, 
it is necessary to represent that in this method of investigation the variable y is 
fixed for each individual and the observed random variable is Z. The likelihood 
is therefore the product over the observed individuals of, again in a condensed 
notation, 


Szipy.w =fyDz,wizD,w/fy\D.w- (7.71) 


The full likelihood function is thus a product of three factors. By (7.70) the 
first factor has a logistic form. The final factor depends only on the known 
functions p(w) and on the numbers of cases and controls and can be ignored. 
The middle factor is a product of terms of the form 


fz(2{L(@ + BTz + vy" w)pi(w) + (Ll — Lla + 6224+ v7 w))po(w)} 
J dvfz@){L(a@ + Bly + yTw)pi(w) + 1 — L(a + BTV + yw))pow)} 
(7.72) 


Thus the parameter of interest, 6, occurs both in the relatively easily handled 
logistic form of the first factor and in the second more complicated factor. One 
justification for ignoring the second factor could be to regard the order of the 
observations as randomized, without loss of generality, and then to regard the 
first logistic factor as a pseudo-likelihood. There is the difficulty, not affecting 
the first-order pseudo-likelihood, that the different terms are not independent 
in the implied new random system, because the total numbers of cases and 
controls are constrained, and may even be equal. Suppose, however, that the 
marginal distribution of Z, i.e., fz(z), depends on unknown parameters œw in 
such a way that when (œw, a, 8, y) are all unknown the second factor on its own 
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provides no information about £. This would be the case if, for example, w 
represented an arbitrary multinomial distribution over a discrete set of possible 
values of z. The profile likelihood for (a, $, y), having maximized out over a, is 
then essentially the logistic first factor which is thus the appropriate likelihood 
function for inference about £. 

Another extreme case arises if fz(z) is completely known, when there is 
in principle further information about £ in the full likelihood. It is unknown 
whether such information is appreciable; it seems unlikely that it would be wise 
to use it. 


7.6.7 Quasi-likelihood 


The previous section concentrated largely on approximate versions of likeli- 
hood applicable when the dependency structure of the observations is relatively 
complicated. Quasi-likelihood deals with problems in which, while a relation 
between the mean and variance is reasonably secure, distributional assumptions 
are suspect. 

The simplest instance is provided by Example 1.3, the linear model in which 
no normality assumption is made and an analysis is made on the basis of the 
second-order assumptions that the variance is constant and observations are 
mutually uncorrelated. The definition of the least squares estimates as obtained 
by orthogonal projection of Y on the space spanned by the columns of the 
defining matrix z and properties depending only on expectations of linear and 
quadratic functions of Y continue to hold. 

The sufficiency argument used above to justify the optimality of least squares 
estimates does depend on normality but holds in weaker form as follows. In 
the linear model E(Y) = z, any linear estimate of a single component of 6 
must have the form /’ Y, where l is an estimating vector. Now l can be resolved 
into a component /, in the column space of z and an orthogonal component /| z. 
A direct calculation shows that the two components of / TY are uncorrelated, so 
that var(J7 Y) is the sum of two contributions. Also the second component has 
zero mean for all 6. Hence for optimality we set l1; = 0, the estimate has the 
form Iz’ Y, and, so long as z is of full rank, is uniquely the component of the 
least squares estimate. The proof may be called an appeal to linear sufficiency. 

If instead of the components of Y being uncorrelated with equal variance, 
they have covariance matrix y V, where y is a scalar, possibly unknown, and V 
is a known positive definite matrix, then the least squares estimates, defined by 
orthogonal projection in a new metric, become 


zTV (Y — zB) =0, (7.73) 
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as can be seen, for example, by a preliminary linear transformation of Y to 
Voy, 

Now suppose that E(Y) = (6), where the vector of means is in general a 
nonlinear function of £, as well as of explanatory variables. We assume that the 
components of Y are uncorrelated and have covariance matrix again specified 
as yV(B), where y is a constant and V(f) is a known function of 6. Now 
the analogue of the space spanned by the columns of z is the tangent space 
defined by 


z" (p) = Vie (8) (7.74) 


a dg x n matrix; note that if w(8) = zB, then the new matrix reduces to 
the previous z7”. The estimating equation that corresponds to the least squares 
estimating equation is thus 


z (BV (BY — u(B)} = 0. (7.75) 


This may be called the quasi-likelihood estimating equation. It generalizes the 
linear least squares and the nonlinear normal-theory least squares equations. 

The asymptotic properties of the resulting estimates can be studied via those 
of the quasi-score function 


z" (B)V (BY — w(B)}, (7.76) 


which mirror those of the ordinary score function for a parametric model. That 
score was defined as the gradient of the log likelihood and in many, but not all, 
situations the quasi-score is itself the gradient of a function Q(6, y) which may 
be called the quasi-likelihood. Just as there are sometimes advantages to basing 
inference on the log likelihood rather than the score, so there may be advantages 
to using the quasi-likelihood rather than the quasi-score, most notably when the 
quasi-likelihood estimating equation has multiple roots. 

By examining a local linearization of this equation, it can be shown that 
the properties of least squares estimates holding in the linear model under 
second-order assumptions extend to the quasi-likelihood estimates. 

Relatively simple examples are provided by logistic regression for binary 
variables and log linear regression for Poisson variables. In the latter case, for 
example, the component observations are independent, the basic Poisson model 
has variances equal to the relevant mean, whereas the more general model has 
variances y times the mean, thus allowing for overdispersion, or indeed for the 
less commonly encountered underdispersion. 
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Notes 7 


Section 7.1. A saddle-point of a surface is a point such that some paths through 
it have local maxima and other paths local minima, as at the top of a mountain 
pass. The Hessian matrix at such a point has eigenvalues of both signs. 


Section 7.2. A survey of nonregular problems is given by Smith (1989). Hall 
and Wang (2005) discuss Bayesian estimation of the end-point of a distribution. 

The special choice of frequencies œ = wp in Example 7.4 ensures such 
identities as 


E cos?’ (wp) = E sin? (wp) = n/2, 


as well as the exact orthogonality of components at different values of œ. 


Section 7.3. The rather exceptional sets in fact have Lebesgue measure zero. 
The discussion here is based on Rotnitzky et al. (2000), where references to 
earlier work are given. The main application is to methods for dealing with 
so-called informative nonresponse. In some simpler situations with singular 
information a transformation of the parameter is all that is required. 


Section 7.4. Cox (1961, 1962) develops a theory for estimates when the model 
is false in connection with the testing of separate families of hypotheses. This is 
concerned with testing a null hypothesis that the model lies in one family when 
the alternatives form a mathematically distinct although numerically similar 
family. The main applications have been econometric (White, 1994). See also 
Note 6.5. 


Section 7.5. The argument for dealing with situations such as the normal mix- 
ture is due to Davies (1977, 1987). For a general discussion, see Ritz and 
Skovgaard (2005). In specific applications simulation is probably the best way 
to establish p-values. 


Section 7.6. There is a very large literature on modified forms of likelihood, 
stemming in effect from Bartlett (1937) and then from Kalbfleisch and Sprott 
(1970). Partial likelihood was introduced by Cox (1972, 1975). The key property 
(7.54) links with the probabilistic notion of a martingale which underlies general 
discussions of maximum likelihood for dependent random variables (Silvey, 
1961). Much work on survival analysis emphasizes this link (Andersen et al., 
1992). 
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Pseudo-likelihood was introduced by Besag (1974, 1977) and studied in a 
form related to the issues outlined here by Azzalini (1983). Example 7.5 is based 
on Cox and Reid (2004). Quasi-likelihood is due to Wedderburn (1974) and its 
optimal properties are given by McCullagh (1983). For a general treatment of 
modified likelihoods , see Lindsay (1988) and Varin and Vidoni (2005). Song 
et al. (2005) discussed breaking the log likelihood into two parts, one with 
relatively simple form and standard behaviour and the other encapsulating the 
complicated aspects. Formal application of these ideas in a Bayesian setting 
would allow specifications to be studied that were less complete than those 
needed for a full Bayesian discussion. 

Other forms of likelihood centre more strongly on non- or semiparametric 
situations and include empirical likelihood (Owen, 2001) in which essentially 
multinomial distributions are considered with support the data values. There is 
a link with the nonparametric bootstrap. For a synthesis of some of these ideas, 
see Mykland (1995), who develops a notion of dual likelihood. 

The fitting of smooth relations that are not specified parametrically is often 
accomplished by maximizing a formal likelihood function from which a cor- 
rection is subtracted depending on the irregularity of the fitted relation. The 
correction might depend, for example, on an average squared value of a high- 
order derivative. While some arbitrariness may be attached both to the form 
of the penalty as well as to the weight attached to it, such penalized functions 
bring some unity to what would otherwise be a rather ad hoc issue. See Green 
and Silverman (1994). 

The treatment of case-control studies largely follows Prentice and Pyke 
(1979) and Farewell (1979). For a comprehensive treatment of case-control 
studies, see Breslow and Day (1980). For a Bayesian analysis, see Seaman and 
Richardson (2004). 
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Additional objectives 


Summary. This chapter deals in outline with a number of topics that fall 
outside the main theme of the book. The topics are prediction, decision ana- 
lysis and point estimation, concentrating especially on estimates that are exactly 
or approximately unbiased. Finally some isolated remarks are made about 
methods, especially for relatively complicated models, that avoid direct use 
of the likelihood. 


8.1 Prediction 


In prediction problems the target of study is not a parameter but the value 
of an unobserved random variable. This includes, however, in so-called hier- 
archical models estimating the value of a random parameter attached to a 
particular portion of the data. In Bayesian theory the formal distinction between 
prediction and estimation largely disappears in that all unknowns have prob- 
ability distributions. In frequentist theory the simplest approach is to use 
Bayes’ theorem to find the distribution of the aspect of interest and to replace 
unknown parameters by good estimates. In special cases more refined treatment 
is possible. 

In the special case when the value Y*, say, to be predicted is conditionally 
independent of the data given the parameters the Bayesian solution is par- 
ticularly simple. A predictive distribution is found by averaging the density 
Fv«(y*; 0) over the posterior distribution of the parameter. 

In special cases a formally exact frequentist predictive distribution is obtained 
by the following device. Suppose that the value to be predicted has the distribu- 
tion fy» (y*; 0*), whereas the data have the density fy (y; 0). Construct a sensible 
test of the null hypothesis 6 = 0* and take all those values of y* consistent 
with the null hypothesis at level c as the prediction interval or region. 
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Example 8.1. A new observation from a normal distribution. Suppose that the 
data correspond to independent and identically distributed observations having 
a normal distribution of unknown mean u and known variance og and that it is 
required to predict a new observation from the same distribution. Suppose then 
that the new observation y* has mean u*. Then the null hypothesis is tested by 
the standard normal statistic 


vay 
o0./(1 + 1/n) 


so that, for example, a level 1—c upper prediction limit for the new observation is 


(8.1) 


5+ kfoo/(1 + 1/n). (8.2) 


The difference from the so-called plug-in predictor, y + k* oo, in which errors 
of estimating u are ignored, is here slight but, especially if a relatively complic- 
ated model is used as the base for prediction, the plug-in estimate may seriously 
underestimate uncertainty. 


8.2 Decision analysis 


In many contexts data are analysed with a view to reaching a decision, for 
example in a laboratory science context about what experiment to do next. It is, 
of course, always important to keep the objective of an investigation in mind, but 
in most cases statistical analysis is used to guide discussion and interpretation 
rather than as an immediate automatic decision-taking device. 

There are at least three approaches to a more fully decision-oriented discus- 
sion. Fully Bayesian decision theory supposes available a decision space D, 
a utility function U (d, 0), a prior for 0, data y, and a model f (y; 0) for the data. 
The objective is to choose a decision rule maximizing for each y the expec- 
ted utility averaged over the posterior distribution of 6. Such a rule is called a 
Bayes rule. The arguments for including a prior distribution are strong in that 
a decision rule should take account of all reasonably relevant information and 
not be confined to the question: what do these data plus model assumptions 
tell us? Here utility is defined in such a way that optimizing the expected utility 
is the appropriate criterion for choice. Utility is thus not necessarily expressed 
in terms of money and, even when it is, utility is not in general a linear function 
of monetary reward. In problems in which the consequences of decisions are 
long-lasting the further issue arises of how to discount future gains or losses 
of utility; exponential discounting may in some circumstances be inappropriate 
and some form of power-law discount function may be more suitable. 
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Wald’s treatment of decision theory supposed that a utility function but not 
a prior is available; in some respects this is often an odd specification in that 
the choice of the utility may be at least as contentious as that of the prior. 
Wald showed that the only admissible decision rules, in a natural definition of 
admissibility, are Bayes rules and limits of Bayes rules. This leaves, of course, 
an enormous range of possibilities open; a minimax regret strategy is one 
possibility. 

A third approach is to fall back on, for example, a more literal interpretation 
of the Neyman—Pearson theory of testing hypotheses and to control error rates at 
prechosen levels and to act differently according as y falls in different regions of 
the sample space. While there is substantial arbitrariness in the choice of critical 
rates in such an approach, it does achieve some consistency between different 
similar applications. In that sense it is sometimes used as an essential component 
by regulatory agencies, such as those dealing with new pharmaceutical products. 


8.3 Point estimation 


8.3.1 General remarks 


The most obvious omission from the previous discussion is point estimation, 
i.e., estimation of a parameter of interest without an explicit statement of 
uncertainty. This involves the choice of one particular value when a range of 
possibilities are entirely consistent with the data. This is best regarded as a 
decision problem. With the exception outlined below imposition of constraints 
like unbiasedness, i.e., that the estimate T of y satisfy 


ET. y,à)= y, (8.3) 


for all (W, à), is somewhat arbitrary. 

Here we consider the point estimate on its own; this is distinct from taking 
it as the basis for forming a pivot, where some arbitrariness in defining the 
estimate can be absorbed into the distribution of the pivot. 

An important generalization is that of an unbiased estimating equation. 
That is, the estimate is defined implicitly as the solution of the equation 


(Vv) =0, (8.4) 
where for all 0 we have 
E{g(Y, v); 0} = 0. (8.5) 


Exact unbiasedness of an estimate, E(T; 0) = w, is indeed rarely an import- 
ant property of an estimate. Note, for example, that if T is exactly unbiased 
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for y, then the estimate A(T) will rarely be exactly unbiased for a nonlinear 
function h(y). In any case our primary emphasis is on estimation by assess- 
ing an interval of uncertainty for the parameter of interest w. Even if this is 
conveniently specified via an implied pivot, a central value and a measure of 
uncertainty, the precise properties of the central value as such are not of direct 
concern. 

There are, however, two somewhat related exceptions. One is where the 
point estimate is an intermediate stage in the analysis of relatively complex 
data. For example, the data may divide naturally into independent sections and 
it may be both convenient and enlightening to first analyse each section on 
its own, reducing it to one or more summarizing statistics. These may then be 
used as input into a model, quite possibly a linear model, to represent their 
systematic variation between sections, or may be pooled by averaging across 
sections. Even quite a small but systematic bias in the individual estimates might 
translate into a serious error in the final conclusion. The second possibility is that 
the conclusions from an analysis may need to be reported in a form such that 
future investigations could use the results in the way just sketched. 


8.3.2 Cramér—Rao lower bound 


A central result in the study of unbiased estimates, the Cramér—Rao inequality, 
shows the minimum variance achievable by an unbiased estimate. For simplicity 
suppose that @ is a scalar parameter and let T = T(Y) be an estimate with bias 
b(@) based on a vector random variable Y with density fy (y; 0). Then 


I WOOO = 0 + 6). (8.6) 


Under the usual regularity conditions, differentiate with respect to 0 to give 


[ OUO: DO =1+0'(6), (8.7) 


where U(y; 6) = 0 log fy(y; @)/08 is the gradient of the log likelihood. Because 
this gradient has zero expectation we can write the last equation 


cov(T, U) = 1 + b' (0), (8.8) 
so that by the Cauchy—Schwarz inequality 
var(T)var(U) > {1 + b'(0)} (8.9) 


with equality attained if and only if T is a linear function of U; see Note 8.2. 
Now var (U) is the expected information i(@) in the data y. Thus, in the particular 
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case of unbiased estimates, 
var(T) > 1/i(0). (8.10) 


A corresponding argument in the multiparameter case gives for the variance 
of an unbiased estimate of, say, the first component of 6 the lower bound i H (8), 
the (1, 1) element of the inverse of the expected information matrix. 

The results here concern the variance of an estimate. The role of information 
in the previous chapter concerned the asymptotic variance of estimates, i.e., the 
variance of the limiting normal distribution, not in general to be confused with 
the limiting form of the variance of the estimate. 


Example 8.2. Exponential family. In the one-parameter exponential family 
with density m(y) exp{sd@ — k(@)} the log likelihood derivative in the canonical 
parameter ¢ is linear in s and hence any linear function of S is an unbiased 
estimate of its expectation achieving the lower bound. In particular this applies 
to S itself as an estimate of the moment parameter 7 = k' (0). 

Thus if Y1, ..., Y, are independently normally distributed with mean u and 
known variance Ge the mean Y is unbiased for u and its variance achieves 
the lower bound og /n. If, however, the parameter of interest is 6 = ju”, the 
estimate Y? — og /n is unbiased but its variance does not achieve the lower 
bound. More seriously, the estimate is not universally sensible in that it estimates 
an essentially nonnegative parameter by a quantity that may be negative, in fact 
with appreciable probability if |u] is very small. The above estimate could be 
replaced by zero in such cases at the cost of unbiasedness. Note, however, that 
would be unwise if the estimate is to be combined linearly with other estimates 
from independent data. 


Example 8.3. Correlation between different estimates. Suppose that T and T* 
are both exactly (or very nearly) unbiased estimates of the same parameter 6 and 
that T has minimum variance among unbiased estimates. Then for any constant 
a the estimate aT + (1 — a)T™ is also unbiased and has variance 


a-var(T) + 2a(1 — a)cov(T, T*) + (1 — a)? var (T*). (8.11) 
This has its minimum with respect to a at a = 1 implying that 
cov(T, T*) = var(T), (8.12) 


as in a sense is obvious, because the estimate T* may be regarded as T plus an 
independent estimation error of zero mean. 
Alternatively 


corr(T, T*) = {var(T)/var(T*)}'/”. (8.13) 
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The ratio of variances on the right-hand side may reasonably be called the 
efficiency of T* relative to T, so that the correlation of an inefficient estim- 
ate with a corresponding minimum variance estimate is the square root of the 
efficiency. For example, it may be shown that the median of a set of n inde- 
pendently normally distributed random variables of mean jz and variance ae is 
an unbiased estimate of u with variance approximately rog /(2n). That is, the 


correlation between mean and median is ./2/,/z, which is approximately 0.8. 


Essentially the same idea can be used to compare the efficiency of tests of 
significance and to study the correlation between two different test statistics 
of the same null hypothesis. For this suppose that T and T*, instead of being 
estimates of 0, are test statistics for a null hypothesis 6 = 69 and such that 
they are (approximately) normally distributed with means u (0) and w*(@) and 
variances o7(0) /n and ož, (0)/n. Now locally near the hypothesis we may 
linearize the expectations writing, for example 


L(A) = u (80) + [du (0)/d0]o (8 — 80), (8.14) 


thereby converting the test statistic into a local estimate 


{T — wo) [d.()/dO J)". (8.15) 


Near the null hypothesis this estimate has approximate variance 

{07 @o)/n} {Ld (8) /d0}o} (8.16) 
This leads to the definition of the efficacy of the test as 

{Ld (0) /dO log }” {0° (00). (8.17) 


The asymptotic efficiency of T* relative to the most efficient test T is then 
defined as the ratio of the efficacies. Furthermore, the result about the correlation 
between two estimates applies approximately to the two test statistics. The 
approximations involved in this calculation can be formalized via considerations 
of asymptotic theory. 


Example 8.4. The sign test. Suppose that starting from a set of independent 
normally distributed observations of unknown mean 0 and known variance og 
we consider testing the null hypothesis 0 = 0o and wish to compare the standard 
test based on the mean F with the sign test, counting the number of observations 


greater than 0o. The former leads to the test statistic 


T=F (8.18) 
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with du(0)/d0 = 1 andefficacy 1/ Oy. The sign test uses T* to be the proportion 
of observations above 69 with binomial variance 1/(4n) at the null hypothesis 
and with 


u*(0) = O( — 6o) — 1/2, (8.19) 


so that the efficacy is 2/(107). Thus the asymptotic efficiency of the sign test 
relative to the test based on the mean is 2/zr. Not surprisingly, this is the ratio 
of the asymptotic variances of the mean and median as estimates of u, as is 
indeed apparent from the way efficacy was defined. 


8.3.3 Construction of unbiased estimates 


Occasionally it is required to find a function of a sufficient statistic, S, that 

is, exactly or very nearly unbiased for a target parameter 0. Let V be any 

unbiased estimate of 0 and let S be a sufficient statistic for the estimation of 0; 

we appeal to the property of completeness mentioned in Section 4.4. Then 

T = E(V | S) is a function of S unbiased for 0 and having minimum variance. 

Note that because the distribution of Y given S does not involve 6 neither does T. 
The mean of T is 


EsE(V | S) = E(V) =8@. (8.20) 
To find var(T) we argue that 
var(V) = varsE(V | $) + varsE(V | S). (8.21) 


The first term on the right-hand side is var(T) and the second term is zero if 
and only if V is a function of S. Thus T has minimum variance. Completeness 
of S in the sense that no function of S can have zero expectation for all 0 shows 
that T is unique, i.e., two different V must lead to the same T. 

A second possibility is that we have an approximately unbiased estimate and 
wish to reduce its bias further. The easiest way to do this appeals to asymp- 
totic considerations and properly belongs to Chapter 6, but will be dealt with 
briefly here. Commonly the initial estimate will have the form A(T), where 
h(.) is a given function, A(0) is the target parameter and T has reasonably well- 
understood properties, in particular its expectation is 0 and its variance is known 
to an adequate approximation. 


Example 8.5. Unbiased estimate of standard deviation. Suppose that 
Y,..., Yn are independently distributed with the same mean jz, the same vari- 
ance t = øg? and the same fourth cumulant ratio y4 = ka/ tT? = La/ tT? — 3, 
where «4 and u4 are respectively the fourth cumulant and the fourth moment 
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about the mean of the distribution of the Yz; note that the notation y4 is preferred 
to the more commonly used notation y2 or the older 62 — 3. If s? is the stand- 
ard estimate of variance, namely X (Yp — ¥)?7 (n — 1), direct calculation with 
expectations shows that s? is exactly unbiased for t and that approximately for 
large n 


var(s*) = 2(1 + y4/2)t7/n. (8.22) 


To obtain an approximately unbiased estimate of o, write T = s? and 
consider ,/7. We expand about t to obtain 


JT = Jt{1+ (T —1)/t}? 
= J/t{l+ (T —1)/(2t) — (T — 1)*/(8t7) +--+}. (8.23) 


Now take expectations, neglecting higher-order terms. There results 


E/T) = /t{1— (1 + y4/2)/(8n)}, (8.24) 
suggesting, recalling that s = yT, the revised estimate 
s{1 + (1 + y4/2)/(8n)}. (8.25) 


This requires either some approximate knowledge of the fourth cumulant ratio 
y4 or its estimation, although the latter requires a large amount of data in the 
light of the instability of estimates of fourth cumulants. 

For a normal originating distribution the correction factor is to this order 
{1 + 1/(8n)}. Because in the normal case the distribution of T is proportional 
to chi-squared with n — | degrees of freedom, it follows that 


J20(n/2) 
JVa— DP (@/2— 1/2) 
from which an exactly unbiased estimate can be constructed. The previous 
argument essentially, in this case, amounts to a Laplace expansion of the ratio 
of gamma functions. 


E/T) = Jt 


(8.26) 


This argument has been given in some detail because it illustrates a simple 
general method; if the originating estimate is a function of several component 
random variables a multivariable expansion will be needed. It is to be stressed 
that the provision of exactly (or very nearly) unbiased estimates is rarely, if ever, 
important in its own right, but it may be part of a pathway to more important 
goals. 

As a footnote to the last example, the discussion raises the relevance of the 
stress laid in elementary discussions of estimation on the importance of dividing 
a sum of squares of deviations from the mean by n — | and not by n in order to 
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estimate variance. There are two justifications. One is that it eases the merging 
of estimates of variance by pooling small samples from different sources; the 
pooled sums of squares are divided by the pooled degrees of freedom. The 
other reason is more one of convention. By always dividing a residual sum of 
squares from a linear model by the residual degrees of freedom, i.e., by the rank 
of the relevant quadratic form, comparability with, for example, the use of the 
Student ¢ distribution for inference about a regression coefficient is achieved. 
The argument that an individual estimate is unbiased has little force in that 
the sampling error is much greater than any bias induced by dividing by, say, 
n rather than by n— 1. In general for a single estimate adjustments that are small 
compared with the sampling error are unimportant; when several or many such 
estimates are involved the position is different. 

Note finally that while comparison of asymptotic variances of alternative 
estimates is essentially a comparison of confidence intervals, and is invariant 
under nonlinear transformation of the target parameter, such stability of inter- 
pretation does not hold for small-sample comparisons either of mean squared 
error or of variances of bias-adjusted estimates. 


8.4 Non-likelihood-based methods 


8.4.1 General remarks 


A limitation, at least superficially, on the discussion in Chapters 1—7 has been 
its strong emphasis on procedures derived from likelihood and gaining both 
intuitive appeal and formal strength thereby. While, especially with modern 
computational developments, tackling new problems via some form of like- 
lihood is often very attractive, nevertheless there are many contexts in which 
some less formal approach is sensible. This and the following section therefore 
consider methods in which likelihood is not used, at least not explicitly. 

Likelihood plays two roles in the earlier discussion. One is to indicate 
which aspects of the data are relevant and the other, especially in the Bayesian 
approach, is to show directly how to extract interpretable inference. In some of 
the situations to be outlined, it may be that some other considerations indicate 
summarizing features of the data. Once that step has been taken likelihood- 
based ideas may be used as if those summarizing features were the only 
data available. A possible loss of efficiency can sometimes then be assessed 
by, for example, comparing the information matrix from the reduced data 
with that from the originating data. Most of the previous discussion is then 
available. 
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Some of the reasons for not using the likelihood of the full data are as 
follows: 


e obtaining the likelihood in useful form may be either impossible or 
prohibitively time-consuming; 

e it may be desired to express a dependency via a smooth curve or surface not 
of prespecified parametric form but obtained via some kind of smoothing 
method; 

e it may be desired to make precision assessment and significance testing by 
cautious methods making no parametric assumption about the form of 
probability distributions of error; 

e there may be a need to use procedures relatively insensitive to particular 
kinds of contamination of the data; 

e it may be wise to replace complex computations with some more 
transparent procedure, at least as a preliminary or for reassurance; 

e there may be a need for very simple procedures, for example for 
computational reasons, although computational savings as such are less 
commonly important than was the case even a few years ago. 


8.4.2 Alternatives to the log likelihood 


We begin with a relatively minor modification in which the log likelihood /(@) is 
replaced by a function intended to be essentially equivalent to it but in a sense 
simpler to handle. To the order of asymptotic theory considered here /(@) could 
be replaced by any function having its maximum at 6+ O,(1/n) and having 
second derivatives at the maximum that are j{1 + O,(1/,/n)}. There are many 
such functions leading to estimates, confidence limits and tests that are all in 
a sense approximately equivalent to those based on /(@). That is, the estimates 
typically differ by Op(1/n), a small difference when compared with the standard 
errors which are Op(1/,/n). 

In fact a number of considerations, notably behaviour in mildly anomalous 
cases and especially properties derived from higher-order expansions, suggest 
that log likelihood is the preferred criterion. Sometimes, however, it is useful 
to consider other possibilities and we now consider one such, reduction to the 
method of least squares. 

For this we suppose that we have m random variables Y1,..., Y;, and that 
each is a condensation of more detailed data, each Y% having a large information 
content, i.e., O(n). A fuller notation would replace Yz, say, by y” to emphasize 
that the asymptotics we consider involve not increasing m but rather increasing 
internal replication within each Yz. 
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Suppose that there is a transformation from Y, to h(Yķ) such that 
asymptotically in the transformed variable is normally distributed with mean 
a simple function of 0 and variance tg which can be estimated by vg such that 
Ve /T = 1 + Op(1/ a/n). Although the restriction is not essential, we shall take 
the simple function of 6 to be linear, and using the notation of the linear model, 
write the m x 1 vector of asymptotic means in the form 


E{h(Y)} = zB, (8.27) 


where z is an m x p matrix of constants of full rank p < mand £ is ap x | vector 
of parameters and cov{h(Y)} is consistently estimated by diag(v). This is to 
be taken as meaning that the log likelihood from the formal linear model just 
specified agrees with that from whatever model was originally specified for Y 
with error Op(1/ a/n). Note for this that the log likelihood for h(Y;) differs by 
at most a constant from that for Yg and that typically a random variable whose 
asymptotic distribution is standard normal has a distribution differing from the 
standard normal by O(1/,/n). In this argument, as noted above, the asympotics, 
and hence the approximations, derive from internal replication within each Yx, 
not from notional increases in m. 

Application of standard weighted least squares methods to (8.27) may be 
called locally linearized least squares with empirically estimated weights. 


Example 8.6. Summarization of binary risk comparisons. Suppose that the 
primary data are binary outcomes for individuals who have one of two possible 
levels of a risk factor or treatment. In the kth investigation for k = 1,...,m, 
there are ny individuals in risk category t for t = 0,1 of whom rg, show a 
positive response. If zy, denotes the corresponding probability of a positive 
response, we have previously in Example 4.5 considered the linear logistic 
model in which 


log{nTu/ (1 — Tg)} = Qg + At. (8.28) 


There are arbitrary differences between levels of k but the logistic difference 
between t = 0 and t = 1 is constant. 

We take Y; to be the collection (ngo, Rko, nki, Rki), Where the Ry, are binomial 
random variables. Let 


Rua (ngo — Reo) 
Ryo (me — Res) 
This may be called the empirical logistic difference between the two conditions 


because it is a difference of two observed log odds, each of which is an empirical 
logistic transform. 


h(¥,) = log (8.29) 
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Then asymptotically h(Y;) is normally distributed with mean A and variance 
estimated by 


vk = 1/Rkı + 1/Reo + 1/ (net — Rki) + 1/(nko — Rro). (8.30) 


Weighted least squares gives for the estimation of the single parameter A the 
estimate and asymptotic variance 


~ rh(%)v;! 
es A (8.31) 
Dv, 
and 
1/Ev;". (8.32) 
Moreover, the weighted residual sum of squares, namely 
BAYA) — ÄP /ve, (8.33) 


has an asymptotic chi-squared distribution with m — 1 degrees of freedom and is 
the asymptotic equivalent to the likelihood ratio and other tests of the constancy 
of the logistic difference for all k. 


The following further points arise. 


Other functions than the logistic difference could be used, for example the 
estimated difference between probabilities themselves. 

If significant heterogeneity is found via the chi-squared test, the next step is 
to find relevant explanatory variables characterizing the different k and to 
insert them as explanatory variables in the linear model. 

If it is suspected that the binomial variation underlying the calculation of vz 
underestimates the variability, inflation factors can be applied directly to 
the vz. 

If the data emerging for each k and encapsulated in Yg include adjustments 
for additional explanatory variables, the argument is unchanged, except 
possibly for minor changes to vz to allow for the additional variance 
attached to the adjustments. 

The standard subsidiary methods of analysis associated with the method of 
least squares, for example those based on residuals, are available. 


Any advantage to this method as compared with direct use of the log like- 
lihood probably lies more in its transparency than in any computational gain 
from the avoidance of iterative calculations. 
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8.4.3 Complex models 


One way of fitting a complex model, for example a stochastic model of temporal 
or spatial-temporal development, is to find which properties of the model can 
be evaluated analytically in reasonably tractable form. For a model with a 
d-dimensional parameter, the simplest procedure is then to equate d functions of 
the data to their analytical analogues and to solve the resulting nonlinear equa- 
tions numerically. If there is a choice of criteria, ones judged on some intuitive 
basis as likely to be distributed independently are chosen. For example, from 
data on daily rainfall one might fit a four-parameter stochastic model by equat- 
ing observed and theoretical values of the mean and variance, the proportion of 
dry days and the lag one serial correlation coefficient. 

If s further features can be evaluated analytically they can be used for an 
informal test of adequacy, parallelling informally the sufficiency decomposi- 
tion of the data used in simpler problems. Alternatively, the d + s features could 
be fitted by the d parameters using some form of weighted nonlinear general- 
ized least squares. For this a covariance matrix would have to be assumed or 
estimated by simulation after a preliminary fit. 

Such methods lead to what may be called generalized method of moments 
estimates, named after the parallel with a historically important method of 
fitting parametric frequency curves to data in which for d parameters the first 
d moments are equated to their theoretical values. 

Especially if the features used to fit the model are important aspects of the 
system, such methods, although seemingly rather empirical compared with 
those developed earlier in the book, may have appreciable appeal in that even if 
the model is seriously defective at least the fitted model reproduces important 
properties and might, for example, therefore be a suitable base for simulation 
of further properties. 

An important feature in the fitting of relatively complex models, especially 
stochastic models of temporal or spatial-temporal data, is that it may be recog- 
nized that certain features of the data are poorly represented by the model, 
but this may be considered largely irrelevant, at least provisionally. Such fea- 
tures might dominate a formal likelihood analysis of the full data, which would 
therefore be inappropriate. 

Such issues are particularly potent for stochastic models developed in con- 
tinuous time. For formal mathematical purposes, although not for rigorous 
general development, continuous time processes are often easier to deal with 
than the corresponding discrete time models. Whenever a continuous time 
model is fitted to empirical data it is usually important to consider critically 
the range of time scales over which the behaviour of the model is likely to be a 
reasonable approximation to the real system. 


174 Additional objectives 


For example, the sample paths of the stochastic process might contain 
piece-wise deterministic sections which might allow exact determination of 
certain parameters if taken literally. Another somewhat similar issue that may 
arise especially with processes in continuous time is that the very short time 
scale behaviour of the model may be unrealistic, without prejudicing properties 
over wider time spans. 


Example 8.7. Brownian motion. Suppose that the model is that the process 
{Y(t)} is Brownian motion with zero drift and unknown variance parameter 
(diffusion coefficient) «7. That is, increments of the process over disjoint inter- 
vals are assumed independently normally distributed with for any (t,t + h) the 
increment Y(t + h) — Y(t) having zero mean and variance oh. If we observe 
Y(t) over the interval (0, to), we may divide the interval into to/h nonoverlap- 
ping subintervals each of length h and from each obtain an unbiased estimate 
of o? of the form 


{Y (kh + h) — Y(kh)}*/h. (8.34) 


There being to/h independent such estimates, we obtain on averaging an 
estimate with fo/h degrees of freedom. On letting h — 0 it follows that for 
any fo, no matter how small, o? can be estimated with arbitrarily small estima- 
tion error. Even though Brownian motion may be an excellent model for many 
purposes, the conclusion is unrealistic, either because the model does not hold 
over very small time intervals or because the measurement process distorts local 
behaviour. 


8.4.4 Use of notionally inefficient statistics 


As noted in the preliminary remarks, there are various reasons for using other 
than formally efficient procedures under a tightly specified model. In particular 
there is an extensive literature on robust estimation, almost all concentrating 
on controlling the influence of observations extreme in some sense. Deriva- 
tion of such estimates can be done on an intuitive basis or by specifying a 
new model typically with modified tail behaviour. Estimates can in many cases 
be compared via their asymptotic variances; a crucial issue in these discus- 
sions is the preservation of the subject-matter interpretation of the parameter of 
interest. 

More generally, frequentist-type analyses lend themselves to the theoretical 
comparison of alternative methods of analysis and to the study of the sensitiv- 
ity of methods to misspecification errors. This may be done by mathematical 
analysis or by computer simulation or often by a combination. It is common to 
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present the conclusions of such studies by one or more of: 


e the bias, variance and possibly mean square error of alternative point 
estimates; 

e the coverage properties, at a given level, of confidence intervals; 

e the probability of rejecting correct null hypotheses, and of power, i.e., the 
probability of detecting specified departures, both types of comparison 
being at prespecified levels, for example 5 per cent or | per cent values of p. 


These are, however, quite often not the most revealing summaries of, for 
example, extensive simulation studies. For instance, coverage properties may 
better be studied via the whole distributions of pivots based on alternative ana- 
lyses and significance tests similarly by comparing distributions of the random 
variable P. 

In terms of the analysis of particular sets of data the issue is whether spe- 
cific conclusions are vulnerable to underlying assumptions and sensitivity 
assessment has to be appropriately conditional. 

Parallel Bayesian analyses involve the sensitivity of a posterior distribution 
to perturbation of either the likelihood or the prior and again can be studied 
analytically or purely numerically. 


Notes 8 


Section 8.1. Some accounts of statistical analysis put much more emphasis than 
has been done in this book on prediction as contrasted with parameter estim- 
ation. See, for example, Geisser (1993) and Aitchison and Dunsmore (1975). 
Many applications have an element of or ultimate objective of prediction. Often, 
although certainly not always, this is best preceded by an attempt at analysis 
and understanding. For a discussion of prediction from a frequentist viewpoint, 
see Lawless and Fridette (2005). An alternative approach is to develop a notion 
of predictive likelihood. For one such discussion, see Butler (1986). 

A series of papers summarized and substantially developed by Lee and Nelder 
(2006) have defined a notion of /-likelihood in the context of models with 
several layers of error and have applied it to give a unified treatment of a 
very general class of models with random elements in both systematic and 
variance structure. Some controversy surrounds the properties that maximizing 
the h-likelihood confers. 


Section 8.2. An element of decision-making was certainly present in the early 
work of Neyman and Pearson, a Bayesian view being explicit in Neyman and 
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Pearson (1933), and the decision-making aspect of their work was extended 
and encapsulated in the attempt of Wald (1950) to formulate all statistical 
problems as ones of decision-making. He assumed the availability of a util- 
ity (or loss) function but no prior probability distribution, so that Bayesian 
decision rules played a largely technical role in his analysis. Most current treat- 
ments of decision theory are fully Bayesian, in principle at least. There is a very 
strong connection with economic theory which does, however, tend to assume, 
possibly after a largely ritual hesitation, that utility is linear in money. 


Section 8.3. Emphasis has not been placed on point estimation, especially 
perhaps on point estimates designed to minimize mean squared error. The reason 
is partly that squared error is often not a plausible measure of loss, especially in 
asymmetric situations, but more importantly the specification of isolated point 
estimates without any indication of their precision is essentially a decision 
problem and should be treated as such. In most applications in which point 
estimates are reported, either they are an intermediate step in analysis or they 
are essentially the basis of a pivot for inference. In the latter case there is some 
allowable arbitrariness in the specification. 

The Cauchy—Schwarz inequality in the present context is proved by noting 
that, except in degenerate cases, for all constants a, b the random variable 
aT + bU has strictly positive variance. The quadratic equation for, say, b/a 
obtained by equating the variance to zero thus cannot have a real solution. 

The elegant procedure for constructing unbiased estimates via an arbitrary 
unbiased estimate and a sufficient statistic is often called Rao—Blackwellization, 
after its originators C. R. Rao and D. Blackwell. A collection of papers on 
estimating equations was edited by Godambe (1991) and includes a review of 
his own contributions. An especially influential contribution is that of Liang 
and Zeger (1986). 

Example 8.3 follows an early discussion of Fisher (1925a). The notion of 
asymptotic relative efficiency and its relation to that of tests was introduced 
essentially in the form given here by Cochran (1937), although it was set out in 
more formal detail in lecture notes by E. J. G. Pitman, after whom it is usually 
named. There are other ideas of relative efficiency related to more extreme tail 
behaviour. 


Section 8.4. Estimates with the same asymptotic distribution as the maximum 
likelihood estimate to the first order of asymptotic theory are called BAN (best 
asymptotically normal) or if it is desired to stress some regularity conditions 
RBAN (regular best asymptotically normal). The idea of forming summarizing 
statistics on the basis of some mixture of judgement and ease of interpretation 
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and then using these as the base for a second more formal stage of analysis has 
along history, in particular in the context of longitudinal data. In econometrics it 
has been developed as the generalized method of moments. For an account with 
examples under the name ‘the indirect method’, see Jiang and Turnbull (2004). 
The asymptotic variance stated for the log odds is obtained by using local lin- 
earization (the so-called delta method) to show that for any function h(R/n) 
differentiable at u, then asymptotically var{h(R/n)} = {W (u)} var(R /n), 
where u = E(R/n). 

The more nearly linear the function /(.) near jz the better the approximation. 


9 


Randomization-based analysis 


Summary. A different approach to statistical inference is outlined based not on 
a probabilistic model of the data-generating process but on the randomization 
used in study design. The implications of this are developed in simple cases, 
first for sampling and then for the design of experiments. 


9.1 General remarks 


The discussion throughout the book so far rests centrally on the notion of a 
probability model for the data under analysis. Such a model represents, often 
in considerably idealized form, the data-generating process. The parameters 
of interest are intended to capture important and interpretable features of that 
generating process, separated from the accidental features of the particular data. 
That is, the probability model is a model of physically generated variability, 
of course using the word ‘physical’ in some broad sense. This whole approach 
may be called model-based. 

Insome contexts of sampling existing populations and of experimental design 
there is a different approach in which the probability calculations are based 
on the randomization used by the investigator in the planning phases of the 
investigation. We call this a design-based formulation. 

Fortunately there is a close similarity between the methods of analysis emer- 
ging from the two approaches. The more important differences between them 
concern interpretation of the conclusions. Despite the close similarities it seems 
not to be possible to merge a theory of the purely design-based approach seam- 
lessly into the theory developed earlier in the book. This is essentially because 
of the absence of a viable notion of a likelihood for the haphazard component 
of the variability within a purely design-based development. 
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These points will be illustrated in terms of two very simple examples, one of 
sampling and one of experimental design. 


9.2 Sampling a finite population 


9.2.1 First and second moment theory 


Suppose that we have a well-defined population of N labelled individuals and 
that the sth member of the population has a real-valued property ns. The quant- 
ity of interest is assumed to be the finite population mean m, = Xn;/N. The 
initial objective is the estimation of m, together with some assessment of the 
precision of the resulting estimate. More analytical and comparative aspects 
will not be considered here. 

To estimate m, suppose that a sample of n individuals is chosen at random 
without replacement from the population on the basis of the labels. That is, some 
impersonal sampling device is used that ensures that all N!/{n!(N — n)!} dis- 
tinct possible samples have equal probability of selection. Such simple random 
sampling without replacement would rarely be used directly in applications, but 
is the basis of many more elaborate and realistic procedures. For convenience 
of exposition we suppose that the order of the observations y1,...,Yn is also 
randomized. 

Define an indicator variable J, to be 1 if the kth member of the sample is the 
sth member of the population. Then in virtue of the sampling method, not in 
virtue of an assumption about the structure of the population, the distribution 
of the J;, is such that 


Er (ks) = Pr(lks = 1) = 1/N, (9.1) 
and, because the sampling is without replacement, 
Erlis) =  — ôu) — êst) ANN — 1)} + ôkiôst/N, (9.2) 


where ôx is the Kronecker delta symbol equal to 1 if k = / and 0 otherwise. The 
suffix R is to stress that the probability measure is derived from the sampling 
randomization. More complicated methods of sampling would be specified by 
the properties of these indicator random variables, 

It is now possible, by direct if sometimes tedious calculation from these and 
similar higher-order specifications of the distribution, to derive the moments of 
the sample mean and indeed any polynomial function of the sample values. For 
this we note that, for example, the sample mean can be written 


Y = Ek slksns/n. (9.3) 
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It follows immediately from (9.1) that 
Er(y) =m, (9.4) 


so that the sample mean is unbiased in its randomization distribution. 
Similarly 


varr (Y) = Ex svar (Ik s)n/n? + 2Zgs1s>1Cov ks, Insh /n. (9.5) 
It follows from (9.2) that 


varr (y) = (1 — f)¥nn/n, (9.6) 
where the second-moment variability of the finite population is represented by 
Van = E (ns — m)? /(N — 1), (9.7) 


sometimes inadvisably called the finite population variance, and f = n/N is 
the sampling fraction. 

Thus we have a simple generalization of the formula for the variance of a 
sample mean of independent and identically distributed random variables. Quite 
often the proportion f of individuals sampled is small and then the factor 1 — f, 
called the finite population correction, can be omitted. 

A similar argument shows that if = X( Vk — p? /(n — 1), then 


Ep(s*) = Vyn» (9.8) 


so that the pivot 


m — y 
{10d = f) /n}!/? 
has the form of a random variable of zero mean divided by an estimate of its 
standard deviation. A version of the Central Limit Theorem is available for 
this situation, so that asymptotically confidence limits are available for m, by 
pivotal inversion. 

A special case where the discussion can be taken further is when the popu- 
lation values ns are binary, say 0 or 1. Then the number of sampled individuals 
having value | has a hypergeometric distribution and the target population value 
is the number of 1s in the population, a defining parameter of the hypergeo- 
metric distribution and in principle in this special case design-based inference 
is equivalent, formally at least, to parametric inference. 


(9.9) 


9.2.2 Discussion 


Design-based analysis leads to conclusions about the finite population mean 
totally free of assumptions about the structure of the variation in the population 
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and subject, for interval estimation, only to the usually mild approximation 
of normality for the distribution of the pivot. Of course, in practical sampling 
problems there are many complications; we have ignored the possibility that 
supplementary information about the population might point to conditioning 
the sampling distribution on some features or would have indicated a more 
efficient mode of sampling. 

We have already noted that for binary features assumed to vary randomly 
between individuals essentially identical conclusions emerge with no special 
assumption about the sampling procedure but the extremely strong assumption 
that the population features correspond to independent and identically distrib- 
uted random variables. We call such an approach model-based or equivalently 
based on a superpopulation model. That is, the finite population under study is 
regarded as itself a sample from a larger universe, usually hypothetical. 

A very similar conclusion emerges in the Gaussian case. For this, we may 
assume 71,...,”N are independently normally distributed with mean u and 
variance o”. Estimating the finite population mean is essentially equivalent to 
estimating the mean y* of the unobserved individuals and this is a prediction 
problem of the type discussed in Section 8.1. That is, the target parameter is 
an average of a fraction f known exactly, the sample, and an unobserved part, 
a fraction 1 —f, The predictive pivot is, with o? known, the normally distributed 
quantity 


(y* — y)/(o/{1/n + 1/(N — n)}) (9.10) 


and when ø? is unknown an estimated variance is used instead. The pivotal 
distributions are respectively the standard normal and the Student f distribu- 
tion with n — 1 degrees of freedom. Except for the sharper character of the 
distributional result, the exact distributional result contrasted with asymptotic 
normality, this is the same as the design-based inference. 

One difference between the two approaches is that in the model-based formu- 
lation the choice of statistics arises directly from considerations of sufficiency. 
In the design-based method the justification for the use of Y is partly general 
plausibility and partly that only linear functions of the observed values have a 
randomization-based mean that involves m,. The unweighted average y has the 
smallest variance among all such linear unbiased estimates of my. 

While the model-based approach is more in line with the discussion in the 
earlier part of this book, the relative freedom from what in many sampling 
applications might seem very contrived assumptions has meant that the design- 
based approach has been the more favoured in most discussions of sampling in 
the social field, but perhaps rather less so in the natural sciences. 
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Particularly in more complicated problems, but also to aim for greater the- 
oretical unity, it is natural to try to apply likelihood ideas to the design-based 
approach. It is, however, unclear how to do this. One approach is to regard 
(n1, ..-, ny) as the unknown parameter vector. The likelihood then has the fol- 
lowing form. For those s that are observed the likelihood is constant when the 
ns in question equals the corresponding Y and zero otherwise and the likelihood 
does not depend on the unobserved ns. That is, the likelihood summarizes that 
the observations are what they are and that there is no information about the 
unobserved individuals. In a sense this is correct and inevitable. If there is no 
information whatever either about how the sample was chosen or about the 
structure of the population no secure conclusion can be drawn beyond the indi- 
viduals actually sampled; the sample might have been chosen in a highly biased 
way. Information about the unsampled items can come only from an assumption 
of population form or from specification of the sampling procedure. 


9.2.3 A development 


Real sampling problems have many complicating features which we do not 
address here. To illustrate further aspects of the interplay between design- and 
model-based analyses it is, however, useful to consider the following extension. 
Suppose that for each individual there is a further variable z and that these are 
known for all individuals. The finite population mean of z is denoted by m; 
and is thus known. The information about z might indeed be used to set up 
a modified and more efficient sampling scheme, but we continue to consider 
random sampling without replacement. 

Suppose further that it is reasonable to expect approximate proportionality 
between the quantity of interest 7 and z. Most commonly z is some measure 
of size expected to influence the target variable roughly proportionally. After 
the sample is taken y and the sample mean of z, say Z, are known. If there were 
exact proportionality between n and z the finite population mean of n would be 


My = ym;/Z (9.11) 


and it is sensible to consider this as a possible estimate of the finite population 
mean m, in which any discrepancy between z and m; is used as a base for a 
proportional adjustment to y. 

A simple model-based theory of this can be set out in outline as follows. 
Suppose that the individual values, ng, and therefore if observed yg, are random 
variables of the form 


Nk = Bz + Se/zk, (9.12) 
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where ¢1,...,¢y are independently normally distributed with zero mean and 
variance Oe: Conformity with this representation can to some extent be tested 
from the data. When zę is a measure of size and ng is an aggregated effect over 
the individual unit, the square root dependence and approximate normality of 
the error terms have some theoretical justification via a Central Limit like effect 
operating within individuals. 

Analysis of the corresponding sample values is now possible by the method 
of weighted least squares or more directly by ordinary least squares applied to 


the representation 


Yk/ Zk = Blt + Sk (9.13) 


leading to the estimate B = y/z and to estimates of o? and var(ĝ) = oF /(nz). 


Moreover, o? is estimated by 
BR \2 
2 (Yk — B2k)*/zk 
= M, 9.14 
Sg D n—1 ( ) 


The finite population mean of interest, m,, can be written in the form 
ny + (1 —f)2*m/N = n) =fyt+ A -PBE t+ fiz} (9.15) 
where &* denotes summation over the individuals not sampled. Because 
{nz + (1 —f)&*z}/N = mz, (9.16) 


it follows, on replacing 6 by B that the appropriate estimate of m, is My and 
that its variance is 

Gee (9.17) 
Nz 
This can be estimated by replacing oF by s and hence an exact pivot formed 
for estimation of my. 

The design-based analysis is less simple. First, for the choice of my we rely 
either on the informal argument given when introducing the estimate above or 
on the use of an estimate that is optimal under the very special circumstances 
of the superpopulation model but whose properties are studied purely in a 
design-based formulation. 

For this, the previous discussion of sampling without replacement from a 
finite population shows that (y, z) has mean (my, mz) and has covariance matrix, 
in a notation directly extending that used for vyn, 


‘es ) (1 —f)/n. (9.18) 


Ven Vz 
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Ifnis large and the population size N is such that0 < f < 1, the design-based 
sampling distribution of the means is approximately bivariate normal and the 
conditional distribution of y given z has approximately the form 


My + (Vnz/V2z)(Z — Mz) + %, (9.19) 


where ¢ has zero mean and variance Vy»y.z(1 — f)/n, and Vpy.z = Van — v /Vzz 

is the variance residual to the regression on z. We condition on z for the reasons 

discussed in Chapters 4 and 5. In principle z has an exactly known probability 

distribution although we have used only an asymptotic approximation to it. 
Now, on expanding to the first order, the initial estimate is 


My = My + (Y — my) — my (Z — mz)/m, (9.20) 
and, on using the known form of y, this gives that 
My = My + (Z — Mz) (VYnz/Yzz — My/Mz) + ¢; (9.21) 


in the penultimate term we may replace m, by y. 

Now if the superpopulation model holds (vyz/Vzz — m,/mz) = Op(1/./n) 
and the whole term is O,(1/n) and can be ignored, but in general this is not the 
case and the adjusted estimate 


aot er (9.22) 


Z Mz 


is appropriate with variance Vy,.-(1 —f)/n as indicated. Note that in the unlikely 
event that ņ and z are unrelated this recovers y. 

The model-based analysis hinges on the notion, sometimes rather contrived, 
that the finite population of interest is itself a random sample from a super- 
population. There are then needed strong structural assumptions about the 
superpopulation, assumptions which can to some extent be tested from the 
data. There results a clear method of estimation with associated distributional 
results. By contrast the design-based approach does not on its own specify the 
appropriate estimate and the associated standard errors and distributional theory 
are approximations for large sample sizes but assume much less. 


9.3 Design of experiments 


9.3.1 General remarks 


We now turn to the mathematically strongly related but conceptually very 
different issue of the role of randomization in the design of experiments, 
i.e., of investigations in which the investigator has in principle total control 
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over the system under study. As a model of the simplest such situation we 
suppose that there are 2m experimental units and two treatments, T and C, to 
be compared; these can be thought of as a new treatment and a control. The 
investigator assigns one treatment to each experimental unit and measures a 
resulting response, y. 

We shall introduce and compare two possible designs. In the first, a so-called 
completely randomized design, m of the 2m units are selected at random without 
replacement to receive T and the remainder receive C. For the second design 
we assume available supplementary information about the units on the basis of 
which the units can be arranged in pairs. In some aspect related to the response, 
there should be less variation between units within a pair than between units in 
different pairs. Then within each pair one unit is selected at random to receive 
treatment T, the other receiving C. This forms a matched pair design, a special 
case of what for more than two treatments is called a randomized complete 
block design. 

We compare a model-based and a design-based analysis of these two designs. 


9.3.2 Model-based formulation 


A model-based formulation depends on the nature of the response variable, for 
example on whether it is continuous or discrete or binary and on the kind of 
distributional assumptions that are reasonable. One fairly general formulation 
could be based on an exponential family representation, but for simplicity we 
confine ourselves here largely to normal-theory models. 

In the absence of a pairing or blocking structure, using a completely ran- 
domized design, the simplest normal-theory model is to write the random 
variable corresponding to the observation on unit s for s = 1,...,2m in the 
symmetrical form 


Ys = u + ds + 6s. (9.23) 


Here d; = 1 if unit s receives T and d, = —1 if unit s receives C and €, are 
independently normally distributed with zero mean and variance ogg- In this 
formulation the treatment effect is 2Y = A, say, assumed constant, and Gon 
incorporates all sources of random variation arising from variation between 
equivalent experimental units and from measurement error. 

The corresponding formulation for the matched pair design specifies for 
k=1,...,m;t = T, C that the observation on unit k, t is the observed value of 
the random variable 


Vir = U + àk + da + Ex. (9.24) 
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Here again the € are independently normally distributed with zero mean but 
now in general with a different variance Sip: The general principle involved in 
writing down this and similar models for more complicated situations is that if 
some balance is inserted into the design with the objective of improving preci- 
sion then this balance should be represented in the model, in the present case 
by including the possibly large number of nuisance parameters àg, representing 
variation between pairs. Note that either design might have been used and in 
notional comparisons of them the error variances are not likely to be the same. 
Indeed the intention behind the matched design is normally that gijp is likely 
to be appreciably less than Oop: 

In general the analysis of the second model will need to recognize the relat- 
ively large number of nuisance parameters present, but for a linear least squares 
analysis this raises no difficulty. In both designs the difference between the two 
treatment means estimates A = 2 with variance 2/m times the appropriate 
variance. This variance is estimated respectively by the appropriate residual 
mean squares, namely by 


{Dscr(¥s — ¥r)? + Esce Ys — ¥c)7}/(2m — 2) (9.25) 
and by 
{E (Yer — Yro)? — m(¥r — ¥c)*}/(m — 1). (9.26) 


Here in both formulae Yr and Yc are the means of the observations on T and 
on C and in the first formula s C T, for example, refers to a sum over units 
receiving T. 

It can be enlightening, especially in more complicated situations, to summar- 
ize such data in analysis of variance tables which highlight the structure of the 
data as well as showing how to estimate the residual variance. 

These arguments lead to pivots for inference about A, the pivotal distributions 
being of the Student ¢ form with respectively 2m — 2 and m — 1 degrees of 
freedom. 


9.3.3 Design-based formulation 


For a design-based discussion it is simplest to begin with no explicit block or 
pair structure and with a null hypothesis of no treatment effect. Consider what 
may be called the strong null hypothesis that the response observed on unit s 
is a constant, say, ns, characteristic of the sth unit and is independent of the 
allocation of treatments to it and other units. That is, the treatments have no 
differential effect of any sort (in terms of the particular response variable y). 
Note that this hypothesis has no stochastic component. It implies the existence 
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of two potential physical possibilities for each unit, one corresponding to each 
treatment, and that only one of these can be realized. Then it is hypothesized 
that the resulting values of y are identical. 

Note that if an independent and identically distributed random variable were 
added to each possible response, one for T and one for C, so that the dif- 
ference between T and C for a particular unit became a zero mean random 
variable identically distributed for each unit, the observable outcomes would be 
undetectably different from those under the original strong null hypothesis. That 
is, the nonstochastic character of the strong null hypothesis cannot be tested 
empirically or at least not without supplementary information, for example 
about the magnitude of unexplained variation to be anticipated. 

The analysis under this null hypothesis can be developed in two rather dif- 
ferent ways. One is to note that Yr, say, is in the completely randomized design 
the mean of a random sample of size m drawn randomly without replacement 
from the finite population of size 2m and that Yc now has the role played by 
Y* in the sampling-theory discussion of the complementary sample. It follows 
as in the previous section that 


Er(¥r —Yc) =0,  varr(¥r — Yc) = 2v,/m (9.27) 


and that v, is estimated to a rather close approximation by the model-based 
estimate of variance. To see the form for the variance of the estimated effect, 
note that (Yr + Yc) is twice the mean over all units and is a constant in the 
randomization. Thus the difference of means has the variance of twice that of the 
sample mean Yy, itself a random sample corresponding to a sampling fraction 
of 1/2. This argument leads to the same test statistic as in the model-based 
approach. Combined with an appeal to a form of the Central Limit Theorem, 
we have an asymptotic version of the model-based analysis of the completely 
randomized design. We shall see later that an exact distributional analysis is in 
principle possible. 

For the matched pair design the analysis is the same in general approach but 
different in detail. The most direct formulation is to define X;, a new variable 
for each pair, as the difference between the observation on T and that on C. 
Under the strong null hypothesis Xz is equally likely to be x, or —xg, where x, is 
the observed difference for the kth pair. It follows that under the randomization 
distribution X = Yr — Yc has zero mean and variance 


Ex?/m. (9.28) 


The test statistic is a monotonic function of that involved in the model-based 
discussion and the tests are asymptotically equivalent. 
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For comparison with the completely randomized design, note that the 
variance equivalent to that in the linear model is 


Exp /(2m) = (na — neo)*/2m), (9.29) 


as contrasted with what it would be if the completely randomized design were 
used on the same set of units 


E(na — ñ.) /2m — 1). (9.30) 


A rather different approach is the following. Choose a test statistic for com- 
paring the treatments. This could be, but does not have to be, Yr — Yc or the 
corresponding standardized test statistic used above. Now under the strong null 
hypothesis the value of the test statistic can be reconstructed for all possible 
treatment allocations by appropriate permutation of T and C. Then, because 
the design gives to all these arrangements the same probability, the exact ran- 
domization distribution can be found. This both allows the choice of other 
test statistics and also avoids the normal approximation involved in the initial 
discussion. 

Table 9.1 illustrates a very simple case of a matched pair experiment. Com- 
pletion of the enumeration shows that 7 of the 32 configurations give a total 
summed difference of 8, the observed value, or more and, the distribution 
being symmetrical about zero, another 7 show a deviation as great as or more 
extreme than that observed but in the opposite direction. The one-sided p-value 


Table 9.1. Randomization distribution for a simple matched 
pair experiment. First column gives the observed differ- 
ences between two treatments. Remaining columns are the 
first few of the 2° configurations obtained by changing 
signs of differences. Under randomization null hypothesis 
all 2° possibilities have equal probability. Last row gives 
column totals. 


Observed 
difference 
6 6 —6 6 —6 6 —6 
3 3 3 —3 —3 3 3 
—2 2 2 2 2 —2 —2 
2 2 2 2 2 2 2 
—1 1 1 1 1 1 1 
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is 7/32 = 0.22. A normal approximation based on the known mean and vari- 
ance and with a continuity correction of the observed difference from —8 to 
—7 is ®(—7/,/54) = 0.17. Note that because the support of the test stat- 
istic is a subset of even integers, the continuity correction is 1 not the more 
usual 1/2. 

A first point is that in the randomization distribution all possible data config- 
urations have the same value of Ex? so that, for example, the Student ¢ statistic 
arising in the model-based approach is a monotone function of the sample mean 
difference or total. We therefore for simplicity take the sample total Xx, as the 
test statistic rather than Student ¢ and form its distribution by giving each pos- 
sible combination of signs in X(+x;) equal probability 2~”. The one-sided 
p-value for testing the strong null hypothesis is thus the proportion of the 2” 
configurations with sample totals as large or larger than that observed. Altern- 
atively and more directly, computer enumeration can be used to find the exact 
significance level, or some form of sampling used to estimate the level with 
sufficient precision. Improvements on the normal approximation used above 
can sometimes be usefully found by noting that the test statistic is the sum of 
independent nonidentically distributed random variables and has the cumulant 
generating function 


log Er(e?™**) = ¥ log cosh(pxx) (9.31) 


from which higher-order cumulants can be found and corrections for 
nonnormality introduced. 

Such tests are called randomization tests. They are formally identical to 
but conceptually quite different from the permutation tests outlined briefly in 
Chapter 3. Here the basis of the procedure is the randomization used in alloc- 
ating the treatments; there is no assumption about the stochastic variability of 
the individual units. By contrast, the validity of the permutation tests hinges 
on the assumed independence and identical distributional form of the random 
variability. 

In the design-based test the most extreme significance level that could be 
achieved is 2~” in a one-sided test and 2~”"+! in a two-sided test. Thus for 
quite small m there is inbred caution in the design-based analysis that is not 
present in the normal-theory discussion. This caution makes qualitative sense 
in the absence of appreciable prior knowledge about distributional form. There 
is a further important proviso limiting the value of randomization in isolated 
very small investigations. We have implicitly assumed that no additional con- 
ditioning is needed to make the randomization distribution relevant to the data 
being analysed. In particular, the design will have taken account of relevant 
features recognized in advance and for reasonably large m with a large number 
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of arrangements among which to randomize it will often be reasonable to regard 
these arrangements as essentially equivalent. For very small m, however, each 
arrangement may have distinctive features which may be particularly relevant 
to interpretation. The randomization distribution then loses much or all of its 
force. For example in a very small psychological experiment it might be sensible 
to use gender as a basis for pairing the individuals, disregarding age as likely 
to be largely irrelevant. If later it was found that the older subjects fell predom- 
inantly in, say, the control group the relevance of the randomization analysis 
would be suspect. The best procedure in such cases may often be to use the 
most appealing systematic arrangement. Randomization simply of the names 
of the treatments ensures some very limited protection against accusations of 
biased allocation. 


Example 9.1. Two-by-two contingency table. If the response variable is binary, 
with outcomes 1 (success) and 0 (failure), the randomization test takes a very 
simple form under the strong null hypothesis. That is, the total number of Is, 
say r, is considered fixed by hypothesis and the randomization used in design 
ensures that the number of 1s in, say, sample 1 is the number of 1s in a sample of 
size m drawn randomly without replacement from a finite population of r 1s and 
n — r Os. This number has a hypergeometric distribution, recovering Fisher’s 
exact test as a pure randomization test. The restriction to samples of equal size 
is unnecessary here. 

In earlier discussions of the 2 x 2 table, marginal totals have been fixed either 
by design or by what was termed technical conditioning. Here the total number 
of 1s is fixed for a third reason, namely in the light of the strong null hypothesis 
that the outcome on each study individual is totally unaffected by the treatment 
allocation. 


For large m there is a broad agreement between the randomization ana- 
lysis and any model-based analysis for a given distributional form and for 
the choice of test statistic essentially equivalent to Student t. This is clear 
because the model-based distribution is in effect a mixture of a large number 
of randomization distributions. 

The above discussion has been solely in terms of the null hypothesis of 
no treatment effect. The simplest modification of the formalism to allow for 
nonzero effects, for continuous measurements, is to suppose that there is a 
constant A such that on experimental unit s we observe ns + A if T is applied 
and ns otherwise. 

This is the special case for two treatments of unit-treatment additivity. That 
is, the response observed on a particular unit is the sum of a constant char- 
acteristic of the unit and a constant characteristic of the treatment applied to 
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that unit and also is independent of the treatment allocation to other units. 
It is again a deterministic assumption, testable only indirectly. Confidence 
limits for A are obtained by using the relation between confidence intervals 
and tests. Take an arbitrary value of A, say Ag, and subtract it from all the 
observations on T. Test the new data for consistency with the strong null hypo- 
thesis by using the randomization test. Repeat this in principle for all values of 
Ao and take as confidence set all those values not rejected at the appropriate 
level. 

Adjustment of estimated treatment effects for concomitant variables, in prin- 
ciple measured before randomization to ensure that they are not affected by 
treatment effects, is achieved in a model-based approach typically by some 
appropriate linear model. A design-based analysis is usually approximate 
broadly along the lines sketched in Section 9.2 for sampling. 

One of the primary advantages of the design-based approach is that the same 
formulation of unit-treatment additivity leads for a broad class of randomiza- 
tion patterns, i.e., designs, to an unbiased estimate of the treatment differences 
and to an unbiased estimate of the effective error variance involved. This con- 
trasts with the model-based approach which appears to make a special ad hoc 
assumption for each design. For example for a Latin square design a special 
so-called main effect model that appears to be a highly specific assumption is 
postulated. This leads to the estimation of error via a particular residual sum 
of squares. In fact the same residual sum of squares arises in the design-based 
analysis directly from the specific randomization used and without any special 
assumption. 

There is the following further conceptual difference between the two 
approaches. A model-based approach might be regarded as introducing an 
element of extrapolation in that the conclusions apply to some ensemble of 
real or hypothetical repetitions that are involved in defining the parameters. 
The design-based approach, by contrast, is concerned with elucidating what 
happened in the particular circumstances of the investigation being analysed; 
how likely is it that the adverse play of chance in the way the treatments were 
allocated in the study has distorted the conclusions seriously? Any question of 
extrapolation is then one of general scientific principle and method and not a 
specifically statistical issue. 

This discussion of randomization has been from a frequentist standpoint. 
While the role of randomization in a formal Bayesian theory is not so clear, 
from one point of view it can be regarded as an informal justification of 
the independence assumptions built into typical forms of likelihood used in 
Bayesian inference. This is essentially parallel to the interplay between design- 
and model-based analyses in frequentist theory. In the more personalistic 
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approaches it might be argued that what is required is that You believe the 
allocation to be essentially random but empirical experience suggests that, 
in some contexts at least, this is a very hazardous argument! 


Notes 9 


Section 9.1. Randomization has three roles in applications: as a device for 
eliminating biases, for example from unobserved explanatory variables and 
selection effects; as a basis for estimating standard errors; and as a foundation 
for formally exact significance tests. The relative importance of these varies 
greatly between fields. The emphasis in the present discussion is on the con- 
ceptual possibility of statistical inference not based on a probabilistic model of 
the underlying data-generating process. It is important to distinguish permuta- 
tion tests from the numerically equivalent randomization tests. The former are 
based on some symmetries induced by the probabilistic model, whereas the 
latter are a by-product of the procedure used in design. Modern computational 
developments make the use of permutation tests feasible even in quite com- 
plicated problems (Manly, 1997). They provide protection of significance tests 
against departures of distributional form but not against the sometimes more 
treacherous failure of independence assumptions. 


Section 9.2. General accounts of sampling theory are given by Cochran (1977) 
and Thompson (1992) and a more mathematical discussion by Thompson 
(1997). There has been extensive discussion of the relative merits of model- 
and design-based approaches. For the applications to stereology, including 
a discussion of model- versus design-based formulations, see Baddeley and 
Jensen (2005). For a comparison of methods of estimating the variance of ratio 
estimates, see Sundberg (1994). 


Section 9.3. The basic principles of statistical design of experiments were 
set out by Fisher (1935a) and developed, especially in the context of agricul- 
tural field trials, by Yates (1937). For an account of the theory emphasizing 
the generality of the applications, see Cox and Reid (2000). Fisher intro- 
duced randomization and randomization tests but his attitude to the use of 
the tests in applications is unclear; Yates emphasized the importance of ran- 
domization in determining the appropriate estimate of error in complex designs 
but dismissed the detailed randomization analysis as inferior to least squares. 
Kempthorne (1952) took the opposing view, regarding least squares analyses as 


Notes 9 193 


computationally convenient approximations to the randomization analyses. The 
development from the counterfactual model of experimental design to address 
studies of causation in observational studies has been extensively studied; see, 
in particular, Rubin (1978, 1984). By adopting a Bayesian formulation and 
assigning a prior distribution to the constants characterizing the observational 
units, the analysis is brought within the standard Bayesian specification. 


Appendix A 
A brief history 


A very thorough account of the history of the more mathematical side of stat- 
istics up to the 1930’s is given by Hald (1990, 1998, 2006). Stigler (1990) gives 
a broader perspective and Heyde and Seneta (2001) have edited a series of 
vignettes of prominent statisticians born before 1900. 

Many of the great eighteenth and early nineteenth century mathematicians 
had some interest in statistics, often in connection with the analysis of astro- 
nomical data. Laplace (1749-1827) made extensive use of flat priors and 
what was then called the method of inverse probability, now usually called 
Bayesian methods. Gauss (1777-1855) used both this and essentially fre- 
quentist ideas, in particular in his development of least squares methods of 
estimation. Flat priors were strongly criticized by the Irish algebraist Boole 
(1815-1864) and by later Victorian mathematicians and these criticisms were 
repeated by Todhunter (1865) in his influential history of probability. Karl 
Pearson (1857—1936) began as, among other things, an expert in the theory 
of elasticity, and brought Todhunter’s history of that theory to posthumous 
publication (Todhunter, 1886, 1893). 

In one sense the modern era of statistics started with Pearson’s (1900) 
development of the chi-squared goodness of fit test. He assessed this without 
comment by calculating and tabulating the tail area of the distribution. Pearson 
had some interest in Bayesian ideas but seems to have regarded prior distribu- 
tions as essentially frequency distributions. Pearson’s influence was dominant in 
the period up to 1914, his astonishing energy being shown in a very wide range 
of applications of statistical methods, as well as by theoretical contributions. 

While he continued to be active until his death, Pearson’s influence waned 
and it was R. A. Fisher (1890-1962) who, in a number of papers of the highest 
originality (Fisher, 1922, 1925a, 1934, 1935b), laid the foundations for much 
of the subject as now known, certainly for most of the ideas stressed in the 
present book. He distinguished problems of specification, of estimation and 
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of distribution. He introduced notions of likelihood, maximum likelihood and 
conditional inference. He also had a distinctive view of probability. In parallel 
with his theoretical work he suggested major new methods of analysis and 
of experimental design, set out in two books of high impact (Fisher, 1925b, 
1935a). He was firmly anti-Bayesian in the absence of specific reasons for 
adopting a prior distribution (Fisher, 1935a, 1956). In addition to his central 
role is statistical thinking, he was an outstanding geneticist, in particular being 
a pioneer in putting the theory of natural selection on a quantitative footing. 

Fisher had little sympathy for what he regarded as the pedanticism of precise 
mathematical formulation and, only partly for that reason, his papers are not 
always easy to understand. In the mid 1920s J. Neyman (1894-1981), then 
in Warsaw, and E. S. Pearson (1885—1980) began an influential collaboration 
initially designed primarily, it would seem, to clarify Fisher’s writing. This led to 
their theory of testing hypotheses and to Neyman’s development of confidence 
intervals, aiming to clarify Fisher’s idea of fiducial intervals. As late as 1932 
Fisher was writing to Neyman encouragingly about this work, but relations 
soured, notably when Fisher greatly disapproved of a paper of Neyman’s on 
experimental design and no doubt partly because their being in the same building 
at University College London brought them too close to one another! Neyman 
went to Berkeley at the start of World War II in 1939 and had a very major 
influence on US statistics. The differences between Fisher and Neyman were 
real, centring, for example, on the nature of probability and the role, if any, 
for conditioning, but, I think, were not nearly as great as the asperity of the 
arguments between them might suggest. 

Although some of E. S. Pearson’s first work had been on a frequency-based 
view of Bayesian inference, Neyman and Pearson, except for an early eleg- 
ant paper (Neyman and Pearson, 1933), showed little interest in Bayesian 
approaches. Wald (1900-1950) formulated a view of statistical analysis and 
theory that was strongly decision-oriented; he assumed the availability of a 
utility or loss function but not of a prior distribution and this approach too 
attracted much attention for a while, primarily in the USA. 

Fisher in some of his writing emphasized direct use of the likelihood as a 
measure of the relation between data and a model. This theme was developed by 
Edwards (1992) and by Royall (1997) and in much work, not always published, 
by G. A. Barnard (1915-2002); see, in particular, Barnard et al. (1962). 

As explained above, twentieth century interest in what was earlier called 
the inverse probability approach to inference was not strong until about 1950. 
The economist John Maynard Keynes (1883—1946) had written a thesis on the 
objectivist view of these issues but the main contribution to that approach, and 
a highly influential one, was that of the geophysicist and applied mathematician 
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Harold Jeffreys (1891-1989). He wrote on the objective degree of belief view 
of probability in the early 1930s but his work is best approached from the 
very influential Theory of Probability, whose first edition was published in 
1939. He justified improper priors, gave rules for assigning priors under some 
circumstances and developed his ideas to the point where application could 
be made. Comparable developments in the philosophy of science, notably by 
Carnap, do not seem to have approached the point of applicability. The most 
notable recent work on standardized priors as a base for comparative analyses 
is that of Bernardo (1979, 2005); this aims to find priors which maximize the 
contribution of the data to the posterior. 

At the time of these earlier developments the philosopher F. P. Ramsey (1903— 
1930) developed a theory of personalistic probability strongly tied to personal 
decision-making, i.e., linking probability and utility intimately. Ramsey died 
before developing these and other foundational ideas. The systematic develop- 
ment came independently from the work of de Finetti (1906-1985) and Savage 
(1917-1971), both of whom linked personalistic probability with decision- 
making. The pioneering work of I. J. Good (1916—) was also important. There 
were later contributions by many others, notably de Groot (1931-1989) and 
Lindley (1923-—). There followed a period of claims that the arguments for this 
personalistic theory were so persuasive that anything to any extent inconsistent 
with that theory should be discarded. For the last 15 years or so, i.e., since 
about 1990, interest has focused instead on applications, especially encouraged 
by the availability of software for Markov chain Monte Carlo calculations, in 
particular on models of broadly hierarchical type. Many, but not all, of these 
applications make no essential use of the more controversial ideas on person- 
alistic probability and many can be regarded as having at least approximately 
a frequentist justification. 

The later parts of this brief history may seem disturbingly Anglo-centric. 
There have been, of course, many individuals making important original 
contributions in other languages but their impact has been largely isolated. 
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A personal view 


Much of this book has involved an interplay between broadly frequentist dis- 
cussion and a Bayesian approach, the latter usually involving a wider notion 
of the idea of probability. In many, but by no means all, situations numeric- 
ally similar answers can be obtained from the two routes. Both approaches 
occur so widely in the current literature that it is important to appreciate the 
relation between them and for that reason the book has attempted a relatively 
dispassionate assessment. 

This appendix is, by contrast, a more personal statement. Whatever the 
approach to formal inference, formalization of the research question as being 
concerned with aspects of a specified kind of probability model is clearly of 
critical importance. It translates a subject-matter question into a formal statist- 
ical question and that translation must be reasonably faithful and, as far as is 
feasible, the consistency of the model with the data must be checked. How this 
translation from subject-matter problem to statistical model is done is often the 
most critical part of an analysis. Furthermore, all formal representations of the 
process of analysis and its justification are at best idealized models of an often 
complex chain of argument. 

Frequentist analyses are based on a simple and powerful unifying principle. 
The implications of data are examined using measuring techniques such as 
confidence limits and significance tests calibrated, as are other measuring instru- 
ments, indirectly by the hypothetical consequences of their repeated use. In 
particular, they use the notion that consistency in a certain respect of the data 
with a specified value of the parameter of interest can be assessed via a p-value. 
This leads, in particular, to use of sets of confidence intervals, often conveni- 
ently expressed in pivotal form, constituting all parameter values consistent with 
the data at specified levels. In some contexts the emphasis is rather on those 
possible explanations that can reasonably be regarded as refuted by the data. 

The well-known definition of a statistician as someone whose aim in life 
is to be wrong in exactly 5 per cent of everything they do misses its target. 
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The objective is to recognize explicitly the possibility of error and to use that 
recognition to calibrate significance tests and confidence intervals as an aid 
to interpretation. This is to provide a link with the real underlying system, as 
represented by the probabilistic model of the data-generating process. This is 
the role of such methods in analysis and as an aid to interpretation. The more 
formalized use of such methods, using preset levels of significance, is also 
important, for example in connection with the operation of regulatory agencies. 

In principle, the information in the data is split into two parts, one to assess the 
unknown parameters of interest and the other for model criticism. Difficulties 
with the approach are partly technical in evaluating p-values in complicated 
systems and also lie in ensuring that the hypothetical long run used in calibra- 
tion is relevant to the specific data under analysis, often taking due account of 
how the data were obtained. The choice of the appropriate set of hypothetical 
repetitions is in principle fundamental, although in practice much less often a 
focus of immediate concern. The approach explicitly recognizes that drawing 
conclusions is error-prone. Combination with qualitative aspects and external 
knowledge is left to a qualitative synthesis. The approach has many advantages, 
not least in its ability, for example especially in the formulation of Neyman and 
Pearson, to provide a basis for assessing methods of analysis put forward on 
informal grounds and for comparing alternative proposals for data collection 
and analysis. Further, it seems clear that any proposed method of analysis that 
in repeated application would mostly give misleading answers is fatally flawed. 

Some versions of the Bayesian approach address the same issues of inter- 
pretation by a direct use of probability as an impersonal measure of uncertainty, 
largely ignoring other sources of information by the use of priors that are in 
some sense standardized or serving as reference points. While there is inter- 
esting work on how such reference priors may be chosen, it is hard to see a 
conceptual justification for them other than that, at least in low-dimensional 
problems, they lead, at least approximately, to procedures with acceptable fre- 
quentist properties. In particular, relatively flat priors should be used as such 
with caution, if at all, and regarded primarily as a possible route to reason- 
able confidence limits. Flat priors in multidimensional parameters may lead to 
absurd answers. From this general perspective one view of Bayesian proced- 
ures is that, formulated carefully, they may provide a convenient algorithm for 
producing procedures that may have very good frequentist properties. 

A conceptually entirely different Bayesian approach tackles the more ambi- 
tious task of including sources of information additional to the data by doing 
this not qualitatively but rather quantitatively by a specially chosen prior distri- 
bution. This is, of course, sometimes appealing. In particular, the interpretation 
of data often involves serious sources of uncertainty, such as systematic errors, 
other than those directly represented in the statistical model used to represent 
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the data-generating process, as well as uncertainties about the model itself. In a 
more positive spirit, there may be sources of information additional to the data 
under analysis which it would be constructive to include. The central notion, 
used especially in the task of embracing all sorts of uncertainty, is that prob- 
ability assessments should be coherent, i.e., internally consistent. This leads 
to the uncertainty attached to any unsure event or proposition being measured 
by a single real number, obeying the laws of probability theory. Any such use 
of probability as representing a notion of degree of belief is always heavily 
conditional on specification of the problem. 

Major difficulties with this are first that any particular numerical probability, 
say 1/2, is given exactly the same meaning whether it is Your current assessment 
based on virtually no information or it is solidly based on data and theory. 
This confusion is unacceptable for many interpretative purposes. It raises the 
rhetorical question: is it really sensible to suppose that uncertainty in all its 
aspects can always be captured in a single real number? 

Next, the issue of temporal coherency discussed in Section 5.10 is often 
largely ignored; obtaining the data, or preliminary analysis, may entirely prop- 
erly lead to reassessment of the prior information, undermining many of the 
conclusions emanating from a direct application of Bayes’ theorem. Direct 
uses of Bayes’ theorem are in effect the combination of different sources of 
information without checking for mutual consistency. Issues of model criti- 
cism, especially a search for ill-specified departures from the initial model, are 
somewhat less easily addressed within the Bayesian formulation. 

Finally, perhaps more controversially, the underlying definition of probability 
in terms of Your hypothetical behaviour in uncertain situations, subject only to 
internal consistency, is in conflict with the primacy of considering evidence from 
the real world. That is, although internal consistency is desirable, to regard it 
as overwhelmingly predominant is in principle to accept a situation of always 
being self-consistently wrong as preferable to some inconsistent procedure that 
is sometimes, or even quite often, right. 

This is not a rejection of the personalistic Bayesian approach as such; that 
may, subject to suitable sensitivity analysis, sometimes provide a fruitful way of 
injecting further information into an analysis or using data in a more speculative 
way, especially, but, not only, as a basis for personal decision-making. Rather it 
is an emphatic rejection of the notion that the axioms of personalistic probability 
are so compelling that methods not explicitly or implicitly using that approach 
are to be rejected. Some assurance of being somewhere near the truth takes 
precedence over internal consistency. 

One possible role for a prior is the inclusion of expert opinion. This opinion 
may essentially amount to a large number of empirical data informally analysed. 
There are many situations where it would be unwise to ignore such information, 
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although there may be some danger of confusing well-based information with 
the settling of issues by appeal to authority; also empirical experience may 
suggest that expert opinion that is not reasonably firmly evidence-based may 
be forcibly expressed but is in fact fragile. The frequentist approach does not 
ignore such evidence but separates it from the immediate analysis of the specific 
data under consideration. 

The above remarks address the use of statistical methods for the analysis and 
interpretation of data and for the summarization of evidence for future use. For 
automatized decision-making or for the systemization of private opinion, the 
position is rather different. Subject to checks of the quality of the evidence- 
base, use of a prior distribution for relevant features may well be the best route 
for ensuring that valuable information is used in a decision procedure. If that 
information is a statistical frequency the Bayesian procedure is uncontroversial, 
subject to the mutual consistency of the information involved. It is significant 
that the major theoretical accounts of personalistic probability all emphasize 
its connection with decision-making. An assessment of the evidence-base for 
the prior remains important. 

A further type of application often called Bayesian involves risk assessment, 
usually the calculation of the probability of an unlikely event arising from a 
concatenation of circumstances, themselves with poorly evaluated probabilit- 
ies. Such calculations can certainly be valuable, for example in isolating critical 
pathways to disaster. They are probably usually best regarded as frequentist 
probability calculations using poorly known elements and as such demand care- 
ful examination of dependence both on the values of marginal probabilities and 
on the strong, if not always explicit, independence assumptions that often under- 
lie such risk assessments. There are standard techniques of experimental design 
to help with the sensitivity analyses that are desirable in such work. 

The whole of this appendix, and indeed the whole book, is concerned with 
statistical inference. The object is to provide ideas and methods for the critical 
analysis and, as far as feasible, the interpretation of empirical data arising from 
a single experimental or observational study or from a collection of broadly 
similar studies bearing on a common target. The extremely challenging issues 
of scientific inference may be regarded as those of synthesising very different 
kinds of conclusions if possible into a coherent whole or theory and of placing 
specific analyses and conclusions within that framework. This process is surely 
subject to uncertainties beyond those of the component pieces of information 
and, like statistical inference, has the features of demanding stringent evaluation 
of the consistency of information with proposed explanations. The use, if any, 
in this process of simple quantitative notions of probability and their numerical 
assessment is unclear and certainly outside the scope of the present discussion. 
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escape from likelihood pathology, 134 
estimating equation, see unbiased 
estimating equation 
estimating vector, 157 
estimation of variance, critique of 
standard account of, 168 
Euler’s constant, | 13 
exchange paradox, 67, 93 
expectation, properties of, 14, 15 
expert opinion, value and dangers of, 
82, 93 
explanatory variable, conditioning on, 
1: 2 
exponential distribution, 5, 19, 22 
displaced, 137 
exponential family, 20, 23, 28, 96 
Bayesian inference in, 85 
canonical parameter constrained, 122 
canonical parameter in, 21 
canonical statistic in, 21 
choice of priors in, 23 
curved, 22, 23, 121 
frequentist inference for, 50, 51 
incomplete data from, 132 
information for, 98 
mean parameter, 21 
mixed parameterization of, 112 
exponential regression, 75 


factorial experiment, 145 
failure data, see survival data 
false discovery rate, 94 
fiducial distribution, 66, 93, 94 
inconsistency of, 67 
Fieller’s problem, 44 
finite Fourier transform, 138 
finite population correction, 180 
finite population, sampling of, 179, 184 
Fisher and Yates scores, 40 
Fisher’s exact test, see hypergeometric 
distribution, two-by-two 
contingency table 
Fisher’s hyperbola, 22 
Fisher information, see information 
Fisher’s identity, for modified likelihood, 
147 
Fisherian reduction, 24, 47, 68, 69, 89 
asymptotic theory in, 105 
frequentist inference 
conditioning in, 48 


definition of, 8 
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formulation of, 24, 25, 50, 59, 68 
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distribution, 54 
generalized method of moments, 173, 177 
Gothenburg, rain in, 70, 93 
gradient operator, 21, 28 
grid search, 125 
group of transformations, 6, 58 
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hidden periodicity, 138 

higher-order asymptotics, see asymptotic 
theory 
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hypergeometric distribution, 180, 190 

generalized, 53 
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ignorance, 73 

disallowed in personalistic theory, 72 
impersonal degree of belief, 73, 77 
importance sampling, 128 
improper prior, 67 
inefficient estimates, study of, 110 
information 

expected, 97, 119 

in an experiment, 94 

observed, 102 

observed preferred to expected, 132 
information matrix 

expected, 107, 165 

partitioning of, 109, 110 

singular, 139 

transformation of, 108 
informative nonresponse, 140, 159 
innovation, 13 
interval estimate, see confidence interval, 

posterior distribution 
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invariance, 6, 117 

inverse gamma distribution as prior, 
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inverse Gaussian distribution, 90 

inverse probability, see Bayesian 
inference 
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Jeffreys prior, 99 
Jensen’s inequality, 100 


Kullback—Leibler divergence, 141 


Lagrange multiplier, 122, 131 
Laplace expansion, 115, 127, 129, 168 
Laplace transform, see moment 
generating function 
Least squares, 10, 15, 44, 55, 95, 110, 157 
Likelihood, 17, 24, 27 
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conditions for anomalous form, 134 
exceptional interpretation of, 58 
higher derivatives of, 120 
irregular, 135 
law of, 28 
local maxima of, 101 
marginal, 149, 160 
multimodal, 133, 135 
multiple maxima, 101 
partial, 150, 159 
profile, 111, 112 
sequential stopping for, 88 
unbounded, 134 
see also modified likelihood 
likelihood principle, 47, 62 
likelihood ratio, 91 
signed, 104 
sufficient statistic as, 91 
likelihood ratio test 
nonnested problems for, 115 
see also fundamental lemma, profile 
log likelihood 
linear covariance structure, 122, 132 
linear logistic model, 171 
linear model, 4, 19, 20, 55, 145, 148 
linear regression, 4 
linear sufficiency, 157 
location parameter, 5, 48, 57, 73, 98, 129 
log normal distribution, 74 
logistic regression, 140 
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Mantel-Haenszel procedure, 54, 63 
Markov dependence graph, 123 
Markov Property, 12, 119, 152 
Markov chain Monte Carlo (MCMC), 
128,132 
martingale, 159 
matched pairs, 145, 146, 185 
nonnormal, 146 
maximum likelihood estimate 
asymptotic normality of, 102, 108 
definition of, 100 
exponential family, 22 
Laplace density for, 137 
properties of, 102 
mean parameter, see exponential family 
metric, 29 
missing completely at random, 140 
missing information, 127 
mixture of distributions, 144 
model 
base of inference, 178 
choice of, 114, 117 
covering, 121 
failure of, 141, 142 
nature of, 185 
primary and secondary features of, 2 
saturated, 120 
separate families of, 114 
uncertainty, 84 
model criticism, 3, 7, 37, 58, 90 
Poisson model for, 33 
sufficient statistic used for, 19, 33 
modified likelihood, 144, 158, 159 
directly realizable, 149 
factorization based on, 149 
marginal, 75 
need for, 144 
partial, 159 
pseudo, 152 
requirements for, 147, 148 
moment generating function, 15, 21 
multinomial distribution, 33, 53, 63 
multiple testing, 86, 88, 94 
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Neyman factorization theorem, see 
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Neyman—Pearson theory, 25, 29, 33, 36, 
43, 63, 68, 163, 176 
asymptotic theory in, 106 
classification problem for, 92 
fundamental lemma in, 92 
optimality in, 68 
suboptimality of, 69 
non central chi-squared, paradox with, 74 
non-likelihood-based methods, 175 
see also nonparametric test 
non-Markov model, 144 
nonlinear regression, 4, 10, 16, 22, 139 
nonparametric model, 2 
nonparametric test, 37 
normal means, 3, 11, 32, 46, 56, 59, 165 
Bayesian analysis for, 9, 60, 73, 80 
consistency of data and prior for, 85 
information for, 98 
integer parameter space for, 143 
ratio of, 40 
notorious example, 63, 68 
related to regression analysis, 69 
nuisance parameter, see parameters 
null hypothesis, 30 
see significance test 
numerical analysis, 125, 132 


objectives of inference, 7 
observed information, see information 
odds ratio, 5 
one-sided test, 
optional stopping, see sequential stopping 
orbit, 58 
order of magnitude notation, 95, 131 
order statistics, 20, 40 

nonparametric sufficiency of, 38 
orthogonal projection, 157 
orthogonality of parameters 

balanced designs, in, 112 


p-value, see significance test 
parameter space 
dimensionality of, 144 
nonstandard, 142, 144 
variation independence, 2 
parameters 
criteria for, 13, 14, 112 
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of interest, 2 
orthogonality of, 112, 114 
superabbundance of, 145-147 
transformation of, 98, 99, 102, 108, 
131 
vector of interest, 27 
parametric model, 2 
see also nonparametric model, 
semiparametric model 
partial likelihood, 150, 152 
periodogram, 187, 189 
permutation test, 38, 138 
personalistic probability, 79, 81 
upper and lower limits for, 93 
Pitman efficiency, see asymptotic relative 
efficiency 
personalistic probability, see Bayesian 
inference 
pivot, 25, 27, 29, 175 
asymptotic theory, role in, 109 
irregular problems, for, 136 
sampling theory, in, 181 
plug-in estimate for prediction, 161 
plug-in formula, 161, 162 
point estimation, 15, 165-169 
Poisson distribution, 32, 34, 55, 63, 99, 
147 
multiplicative model for, 54, 63 
overdispersion in, 158 
Poisson process, 90 
observed with noise, 41, 124 
posterior distribution, 5, 9 
power law contact, 136 
prediction, 84, 161, 175 
predictive distribution, 161 
predictive likelihood, 175 
primary feature, see model 
prior closed under sampling, see 
conjugate prior 
prior distribution, 9 
consistency with data of, 77, 85 
flat, 73 
improper, 46 
matching, 129, 130 
normal variance, for, 59 
reference, 76, 77, 83, 94 
retrospective, 88 
see also Bayesian inference 
probability 
axioms of, 65 
interpretations of, 7, 65, 70 
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probability (contd) 
personalistic, 71, 72 
range of applicability of, 66 
profile log likelihood, 111, 119 
projection, 10 
proportional hazards model, 151 
protocol, see significance test 
pseudo-score, 152 
pseudo-likelihood, 152, 160 
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case-control study for, 154 
pitfall with, 154 
time series for, 153 
pure birth process, see binary fission 


quadratic statistics, 109 

quasi likelihood, 157, 158, 160 
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random effects, 146 
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randomization, motivation of, 192 

randomization test, 188, 189, 191 

randomized block design, 185 

rank tests, 39 
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recognizable subset, 71, 93 

rectangular distribution, see uniform 
distribution 
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expansion, 132 
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sampling theory, 179-184, 192 
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score, 97 
score test, 104 
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selective reporting, 86, 87 
self-denial, need for, 81 
semi-asymptotic, 124 
semiparametric model, 2, 4, 26, 151, 160 
sensitivity analysis, 66, 82, 175 
separate families, 142, 159 
likelihood ratio for, 142 
sequential stopping, 88, 90, 94 
Shannon information, 94 
Sheppard’s formula, 154 
sign test, 166 
asymptotic efficiency of, 167 
significance test, 30 
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interpretation of, 41, 42 
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singular information matrix, 139, 141 
smoothing, 4 
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Stein’s paradox, 74 
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complete, 50 
factorization theorem for, 18 
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support of distribution, definition of, 14 
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survival function, 113, 151 
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tangent plane, 12, 158 
temporal coherency, see coherency 
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transformation model, 57, 59, 63 
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